Message ID | 20230303213851.2090365-1-joel@joelfernandes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] rcu: Add a minimum time for marking boot as completed | expand |
On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > On many systems, a great deal of boot (in userspace) happens after the > kernel thinks the boot has completed. It is difficult to determine if > the system has really booted from the kernel side. Some features like > lazy-RCU can risk slowing down boot time if, say, a callback has been > added that the boot synchronously depends on. Further expedited callbacks > can get unexpedited way earlier than it should be, thus slowing down > boot (as shown in the data below). > > For these reasons, this commit adds a config option > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > Userspace can also make RCU's view of the system as booted, by writing the > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > Or even just writing a value of 0 to this sysfs node. > However, under no circumstance will the boot be allowed to end earlier > than just before init is launched. > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > suites ChromeOS and also a PREEMPT_RT system below very well, which need > no config or parameter changes, and just a simple application of this patch. A > system designer can also choose a specific value here to keep RCU from marking > boot completion. As noted earlier, RCU's perspective of the system as booted > will not be marker until at least rcu_boot_end_delay milliseconds have passed > or an update is made via writing a small value (or 0) in milliseconds to: > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > One side-effect of this patch is, there is a risk that a real-time workload > launched just after the kernel boots will suffer interruptions due to expedited > RCU, which previous ended just before init was launched. However, to mitigate > such an issue (however unlikely), the user should either tune > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > boots, and before launching the real-time workload. Much better, thank you! > Qiuxu also noted impressive boot-time improvements with earlier version > of patch. An excerpt from the data he shared: > > 1) Testing environment: > OS : CentOS Stream 8 (non-RT OS) > Kernel : v6.2 > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > 2) OS boot time definition: > The time from the start of the kernel boot to the shell command line > prompt is shown from the console. [ Different people may have > different OS boot time definitions. ] > > 3) Measurement method (very rough method): > A timer in the kernel periodically prints the boot time every 100ms. > As soon as the shell command line prompt is shown from the console, > we record the boot time printed by the timer, then the printed boot > time is the OS boot time. > > 4) Measured OS boot time (in seconds) > a) Measured 10 times w/o this patch: > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > The average OS boot time was: ~8.7s > > b) Measure 10 times w/ this patch: > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > The average OS boot time was: ~8.3s. Unfortunately, given that a's average is within one standard deviation of b's average, this is most definitely not statistically significant. Especially given only ten measurements for each case -- you need *at* *least* 24, preferably more. Especially in this case, where you don't really know what the underlying distribution is. But we can apply the binomial distribution instead of the usual normal distribution. First, let's sort and take the medians: a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3 Median: 8.7 b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3 Median: 8.2 8/10 of a's data points are greater than 0.1 more than b's median and 8/10 of b's data points are less than 0.1 less than a's median. What are the odds that this happens by random chance? This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055. This is not quite 95% confidence, so not hugely convincing, but it is at least close. Not that this is the confidence that (b) is 100ms faster than (a), not just that (b) is faster than (a). Not sure that this really carries its weight, but in contrast to the usual statistics based on the normal distribution, it does suggest at least a little improvement. On the other hand, anyone who has carefully studied nonparametric statistics probably jumped out of the boat several paragraphs ago. ;-) A few more questions interspersed below. Thanx, Paul > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > v1->v2: > Update some comments and description. > v2->v3: > Add sysfs param, and update with Test data. > > .../admin-guide/kernel-parameters.txt | 12 ++++ > cc_list | 8 +++ > kernel/rcu/Kconfig | 19 ++++++ > kernel/rcu/update.c | 68 ++++++++++++++++++- > 4 files changed, 106 insertions(+), 1 deletion(-) > create mode 100644 cc_list > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 2429b5e3184b..611de90d9c13 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5085,6 +5085,18 @@ > rcutorture.verbose= [KNL] > Enable additional printk() statements. > > + rcupdate.rcu_boot_end_delay= [KNL] > + Minimum time in milliseconds that must elapse > + before the boot sequence can be marked complete > + from RCU's perspective, after which RCU's behavior > + becomes more relaxed. The default value is also > + configurable via CONFIG_RCU_BOOT_END_DELAY. > + Userspace can also mark the boot as completed > + sooner by writing the time in milliseconds, say once > + userspace considers the system as booted, to: > + /sys/module/rcupdate/parameters/rcu_boot_end_delay > + Or even just writing a value of 0 to this sysfs node. Can userspace also extend the time in this manner? I am not too worried either way, but it would be good to make this clear. If userspace writes a non-zero value, is that from the current time or from boot? > + > rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] > Dump ftrace buffer after reporting RCU CPU > stall warning. > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > index 9071182b1284..4b5ffa36cbaf 100644 > --- a/kernel/rcu/Kconfig > +++ b/kernel/rcu/Kconfig > @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY > > Accept the default if unsure. > > +config RCU_BOOT_END_DELAY > + int "Minimum time before RCU may consider in-kernel boot as completed" > + range 0 120000 > + default 15000 > + help > + Default value of the minimum time in milliseconds that must elapse > + before the boot sequence can be marked complete from RCU's perspective, > + after which RCU's behavior becomes more relaxed. > + Userspace can also mark the boot as completed sooner than this default > + by writing the time in milliseconds, say once userspace considers > + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. > + Or even just writing a value of 0 to this sysfs node. > + > + The actual delay for RCU's view of the system to be marked as booted can be > + higher than this value if the kernel takes a long time to initialize but it > + will never be smaller than this value. > + > + Accept the default if unsure. > + > config RCU_EXP_KTHREAD > bool "Perform RCU expedited work in a real-time kthread" > depends on RCU_BOOST && RCU_EXPERT > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 19bf6fa3ee6a..93138c92136e 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void) > } > EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); > > +/* > + * Minimum time in milliseconds until RCU can consider in-kernel boot as > + * completed. This can also be tuned at runtime to end the boot earlier, by > + * userspace init code writing the time in milliseconds (even 0) to: > + * /sys/module/rcupdate/parameters/rcu_boot_end_delay > + */ > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; > + > static bool rcu_boot_ended __read_mostly; > +static bool rcu_boot_end_called __read_mostly; > +static DEFINE_MUTEX(rcu_boot_end_lock); > + > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) > +{ > + uint end_ms; > + int ret = kstrtouint(val, 0, &end_ms); > + > + if (ret) > + return ret; > + WRITE_ONCE(*(uint *)kp->arg, end_ms); Doesn't this write to rcu_boot_end_delay outside of the lock? > + > + /* > + * rcu_end_inkernel_boot() should be called at least once during init > + * before we can allow param changes to end the boot. > + */ > + mutex_lock(&rcu_boot_end_lock); > + rcu_boot_end_delay = end_ms; > + if (!rcu_boot_ended && rcu_boot_end_called) { > + mutex_unlock(&rcu_boot_end_lock); > + rcu_end_inkernel_boot(); Temporarily dropping rcu_boot_end_lock looks like an accident waiting to happen. > + } > + mutex_unlock(&rcu_boot_end_lock); And dropping it twice does not seem good, either. Or am I missing some subtle control-flow trick? > + return ret; > +} > + > +static const struct kernel_param_ops rcu_boot_end_ops = { > + .set = param_set_rcu_boot_end, > + .get = param_get_uint, > +}; > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); > > /* > - * Inform RCU of the end of the in-kernel boot sequence. > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. > */ > +void rcu_end_inkernel_boot(void); > +static void rcu_boot_end_work_fn(struct work_struct *work) > +{ > + rcu_end_inkernel_boot(); > +} > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); > + > void rcu_end_inkernel_boot(void) > { > + mutex_lock(&rcu_boot_end_lock); > + rcu_boot_end_called = true; > + > + if (rcu_boot_ended) > + return; > + > + if (rcu_boot_end_delay) { > + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); > + > + if (boot_ms < rcu_boot_end_delay) { Isn't it necessary to cancel a previously scheduled work to make sure that the new value overrides the old one? Mightn't this be simpler if the user was only permitted to write zero, thus just saying "stop immediately"? If people really need the ability to extend or shorten the time, a patch can be produced at that point. And then a non-zero write to the file would become legal. > + schedule_delayed_work(&rcu_boot_end_work, > + rcu_boot_end_delay - boot_ms); > + mutex_unlock(&rcu_boot_end_lock); > + return; > + } > + } > + > + cancel_delayed_work(&rcu_boot_end_work); > rcu_unexpedite_gp(); > rcu_async_relax(); > if (rcu_normal_after_boot) > WRITE_ONCE(rcu_normal, 1); > rcu_boot_ended = true; > + mutex_unlock(&rcu_boot_end_lock); > } > > /* > -- > 2.40.0.rc0.216.gc4246ad0f0-goog
Hi Paul, On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote: [..] > > Qiuxu also noted impressive boot-time improvements with earlier version > > of patch. An excerpt from the data he shared: > > > > 1) Testing environment: > > OS : CentOS Stream 8 (non-RT OS) > > Kernel : v6.2 > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > 2) OS boot time definition: > > The time from the start of the kernel boot to the shell command line > > prompt is shown from the console. [ Different people may have > > different OS boot time definitions. ] > > > > 3) Measurement method (very rough method): > > A timer in the kernel periodically prints the boot time every 100ms. > > As soon as the shell command line prompt is shown from the console, > > we record the boot time printed by the timer, then the printed boot > > time is the OS boot time. > > > > 4) Measured OS boot time (in seconds) > > a) Measured 10 times w/o this patch: > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > The average OS boot time was: ~8.7s > > > > b) Measure 10 times w/ this patch: > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > The average OS boot time was: ~8.3s. > > Unfortunately, given that a's average is within one standard deviation > of b's average, this is most definitely not statistically significant. > Especially given only ten measurements for each case -- you need *at* > *least* 24, preferably more. Especially in this case, where you don't > really know what the underlying distribution is. > > But we can apply the binomial distribution instead of the usual > normal distribution. First, let's sort and take the medians: > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3 Median: 8.7 > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3 Median: 8.2 > > 8/10 of a's data points are greater than 0.1 more than b's median > and 8/10 of b's data points are less than 0.1 less than a's median. > What are the odds that this happens by random chance? > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055. > This is not quite 95% confidence, so not hugely convincing, but it is at > least close. Not that this is the confidence that (b) is 100ms faster > than (a), not just that (b) is faster than (a). > > Not sure that this really carries its weight, but in contrast to the > usual statistics based on the normal distribution, it does suggest at > least a little improvement. On the other hand, anyone who has carefully > studied nonparametric statistics probably jumped out of the boat several > paragraphs ago. ;-) Thanks for the analysis, I did feel the samples were few. I am happy to update it with more data if Qiuxu can collect more samples and provide. [..] > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5085,6 +5085,18 @@ > > rcutorture.verbose= [KNL] > > Enable additional printk() statements. > > > > + rcupdate.rcu_boot_end_delay= [KNL] > > + Minimum time in milliseconds that must elapse > > + before the boot sequence can be marked complete > > + from RCU's perspective, after which RCU's behavior > > + becomes more relaxed. The default value is also > > + configurable via CONFIG_RCU_BOOT_END_DELAY. > > + Userspace can also mark the boot as completed > > + sooner by writing the time in milliseconds, say once > > + userspace considers the system as booted, to: > > + /sys/module/rcupdate/parameters/rcu_boot_end_delay > > + Or even just writing a value of 0 to this sysfs node. > > Can userspace also extend the time in this manner? I am not too worried > either way, but it would be good to make this clear. Yes, it can be extended because once the default timer fires, it will schedule a new timer to account for that. Thanks, I'll clarify in the above docs. > If userspace writes a non-zero value, is that from the current time or > from boot? Good point, it is from the start of boot always, I fixed it. [..] > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > > index 19bf6fa3ee6a..93138c92136e 100644 > > --- a/kernel/rcu/update.c > > +++ b/kernel/rcu/update.c > > @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void) > > } > > EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); > > > > +/* > > + * Minimum time in milliseconds until RCU can consider in-kernel boot as > > + * completed. This can also be tuned at runtime to end the boot earlier, by > > + * userspace init code writing the time in milliseconds (even 0) to: > > + * /sys/module/rcupdate/parameters/rcu_boot_end_delay > > + */ > > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; > > + > > static bool rcu_boot_ended __read_mostly; > > +static bool rcu_boot_end_called __read_mostly; > > +static DEFINE_MUTEX(rcu_boot_end_lock); > > + > > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) > > +{ > > + uint end_ms; > > + int ret = kstrtouint(val, 0, &end_ms); > > + > > + if (ret) > > + return ret; > > + WRITE_ONCE(*(uint *)kp->arg, end_ms); > > Doesn't this write to rcu_boot_end_delay outside of the lock? True, but actually I realize I don't even need to do it because I overwrite it in the next step ;-). So I'll just remove it. > > + > > + /* > > + * rcu_end_inkernel_boot() should be called at least once during init > > + * before we can allow param changes to end the boot. > > + */ > > + mutex_lock(&rcu_boot_end_lock); > > + rcu_boot_end_delay = end_ms; > > + if (!rcu_boot_ended && rcu_boot_end_called) { > > + mutex_unlock(&rcu_boot_end_lock); > > + rcu_end_inkernel_boot(); > > Temporarily dropping rcu_boot_end_lock looks like an accident waiting > to happen. > > > + } > > + mutex_unlock(&rcu_boot_end_lock); > > And dropping it twice does not seem good, either. Or am I missing some > subtle control-flow trick? You are quite right, sorry to miss it. To prevent this sort of issue happening again, I moved the locking to the caller which also simplifies the code a bit and prevents such traps. > > + return ret; > > +} > > + > > +static const struct kernel_param_ops rcu_boot_end_ops = { > > + .set = param_set_rcu_boot_end, > > + .get = param_get_uint, > > +}; > > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); > > > > /* > > - * Inform RCU of the end of the in-kernel boot sequence. > > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will > > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. > > */ > > +void rcu_end_inkernel_boot(void); > > +static void rcu_boot_end_work_fn(struct work_struct *work) > > +{ > > + rcu_end_inkernel_boot(); > > +} > > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); > > + > > void rcu_end_inkernel_boot(void) > > { > > + mutex_lock(&rcu_boot_end_lock); > > + rcu_boot_end_called = true; > > + > > + if (rcu_boot_ended) > > + return; > > + > > + if (rcu_boot_end_delay) { > > + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); > > + > > + if (boot_ms < rcu_boot_end_delay) { > > Isn't it necessary to cancel a previously scheduled work to make sure > that the new value overrides the old one? No it is not necessary, as we can keep the older timer and once it fires, its callback will call rcu_end_inkernel_boot() which will queue another timer to extend the delay further. As long as 'rcu_boot_end_delay' is updated, it will work fine. You can see that in test #3 below. Actually this part of the code is equivalent to what I had in my first patch, so it is not any new code I am adding. > Mightn't this be simpler if the user was only permitted to write zero, > thus just saying "stop immediately"? If people really need the ability > to extend or shorten the time, a patch can be produced at that point. > And then a non-zero write to the file would become legal. I prefer to keep it this way as with this method, I can not only get to have variable rcu_boot_end_delay via boot parameter (as in my first patch), I also don't need to add a separate sysfs entry, and can just reuse 'rcu_boot_end_delay' parameter, which I also had in my first patch. And adding yet another sysfs parameter will actually complicate it even more and add more lines of code. I tested difference scenarios and it works fine, though I missed that mutex locking unfortunately, I did verify different test cases work as expected by manual testing. Here are some printks and on simple testing in Qemu: 1. End the boot early, CONFIG is set to 120 seconds: ================================================== [ 1.614968] rcu_boot_end_delay = 120000 [ 1.617630] schedule delayed work joel Boot took 1.57 seconds root@(none):/# cat /sys/module/rcupdate/parameters/rcu_boot_end_delay 120000 root@(none):/# root@(none):/# root@(none):/# echo 0 > /sys/module/rcupdate/parameters/rcu_boot_end_delay [ 10.108394] param called joel [ 10.110520] sys calling boot ended [ 10.112730] rcu_boot_end_delay = 0 [ 10.115017] boot ended joel ----------------------------------------------- 2. End the boot passing in rcupdate.rcu_boot_end_delay as 10s. This should overwride the CONFIG of 120 seconds: ================================================== [ 1.700090] rcu_boot_end_delay = 10000 [ 1.702628] schedule delayed work joel Boot took 1.64 seconds root@(none):/# [ 10.414008] rcu_boot_end_delay = 10000 [ 10.416670] boot ended joel ----------------------------------------------- 3. Do the same thing as #2, but extend the boot via sysfs to be longer than 10 seconds: ================================================== [ 0.060025] param called joel [ 0.060026] param called too early joel [ 1.663905] rcu_boot_end_delay = 10000 [ 1.667051] schedule delayed work joel Boot took 1.61 seconds root@(none):/# root@(none):/# echo 20000 > /sys/module/rcupdate/parameters/rcu_boot_end_delay [ 6.932517] param called joel [ 6.934637] sys calling boot ended [ 6.936845] rcu_boot_end_delay = 20000 [ 6.939291] schedule delayed work joel root@(none):/# [ 10.389366] rcu_boot_end_delay = 20000 [ 10.392047] schedule delayed work joel [ 20.117416] rcu_boot_end_delay = 20000 [ 20.120073] boot ended joel ----------------------------------------------- The debug patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/lazy/postboot Appended is the updated v4 patch, tested as shown above, more testing is in progress. thanks, - Joel ---8<----------------------- From: "Joel Fernandes (Google)" <joel@joelfernandes.org> Subject: [PATCH v4] rcu: Add a minimum time for marking boot as completed On many systems, a great deal of boot (in userspace) happens after the kernel thinks the boot has completed. It is difficult to determine if the system has really booted from the kernel side. Some features like lazy-RCU can risk slowing down boot time if, say, a callback has been added that the boot synchronously depends on. Further expedited callbacks can get unexpedited way earlier than it should be, thus slowing down boot (as shown in the data below). For these reasons, this commit adds a config option 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. Userspace can also make RCU's view of the system as booted, by writing the time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay Or even just writing a value of 0 to this sysfs node. However, under no circumstance will the boot be allowed to end earlier than just before init is launched. The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This suites ChromeOS and also a PREEMPT_RT system below very well, which need no config or parameter changes, and just a simple application of this patch. A system designer can also choose a specific value here to keep RCU from marking boot completion. As noted earlier, RCU's perspective of the system as booted will not be marker until at least rcu_boot_end_delay milliseconds have passed or an update is made via writing a small value (or 0) in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. One side-effect of this patch is, there is a risk that a real-time workload launched just after the kernel boots will suffer interruptions due to expedited RCU, which previous ended just before init was launched. However, to mitigate such an issue (however unlikely), the user should either tune CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace boots, and before launching the real-time workload. Qiuxu also noted impressive boot-time improvements with earlier version of patch. An excerpt from the data he shared: 1) Testing environment: OS : CentOS Stream 8 (non-RT OS) Kernel : v6.2 Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … 2) OS boot time definition: The time from the start of the kernel boot to the shell command line prompt is shown from the console. [ Different people may have different OS boot time definitions. ] 3) Measurement method (very rough method): A timer in the kernel periodically prints the boot time every 100ms. As soon as the shell command line prompt is shown from the console, we record the boot time printed by the timer, then the printed boot time is the OS boot time. 4) Measured OS boot time (in seconds) a) Measured 10 times w/o this patch: 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s The average OS boot time was: ~8.7s b) Measure 10 times w/ this patch: 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s The average OS boot time was: ~8.3s. option-prefix PATCH v4 option-start Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> diff-note-start v1->v2: Update some comments and description. v2->v3: Add sysfs param, and update with Test data. v3->v4: Fix locking bug found by Paul, make code more robust by refactoring locking code. Doc updates. --- .../admin-guide/kernel-parameters.txt | 15 ++++ cc_list | 8 ++ kernel/rcu/Kconfig | 21 ++++++ kernel/rcu/update.c | 74 ++++++++++++++++++- 4 files changed, 116 insertions(+), 2 deletions(-) create mode 100644 cc_list diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2429b5e3184b..878c2780f5db 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5085,6 +5085,21 @@ rcutorture.verbose= [KNL] Enable additional printk() statements. + rcupdate.rcu_boot_end_delay= [KNL] + Minimum time in milliseconds from the start of boot + that must elapse before the boot sequence can be marked + complete from RCU's perspective, after which RCU's + behavior becomes more relaxed. The default value is also + configurable via CONFIG_RCU_BOOT_END_DELAY. + Userspace can also mark the boot as completed + sooner by writing the time in milliseconds, say once + userspace considers the system as booted, to: + /sys/module/rcupdate/parameters/rcu_boot_end_delay + Or even just writing a value of 0 to this sysfs node. + The sysfs node can also be used to extend the delay + to be larger than the default, assuming the marking + of boot complete has not yet occurred. + rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] Dump ftrace buffer after reporting RCU CPU stall warning. diff --git a/cc_list b/cc_list new file mode 100644 index 000000000000..7daed4877f5a --- /dev/null +++ b/cc_list @@ -0,0 +1,8 @@ +Frederic Weisbecker <frederic@kernel.org> +Joel Fernandes <joel@joelfernandes.org> +Lai Jiangshan <jiangshanlai@gmail.com> +linux-doc@vger.kernel.org +linux-kernel@vger.kernel.org +"Paul E. McKenney" <paulmck@kernel.org> +rcu@vger.kernel.org +urezki@gmail.com diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index 9071182b1284..97f68120d1c0 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -217,6 +217,27 @@ config RCU_BOOST_DELAY Accept the default if unsure. +config RCU_BOOT_END_DELAY + int "Minimum time before RCU may consider in-kernel boot as completed" + range 0 120000 + default 15000 + help + Default value of the minimum time in milliseconds from the start of boot + that must elapse before the boot sequence can be marked complete from RCU's + perspective, after which RCU's behavior becomes more relaxed. + Userspace can also mark the boot as completed sooner than this default + by writing the time in milliseconds, say once userspace considers + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. + Or even just writing a value of 0 to this sysfs node. The sysfs node can + also be used to extend the delay to be larger than the default, assuming + the marking of boot completion has not yet occurred. + + The actual delay for RCU's view of the system to be marked as booted can be + higher than this value if the kernel takes a long time to initialize but it + will never be smaller than this value. + + Accept the default if unsure. + config RCU_EXP_KTHREAD bool "Perform RCU expedited work in a real-time kthread" depends on RCU_BOOST && RCU_EXPERT diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 19bf6fa3ee6a..18ed3c15e6b5 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -224,13 +224,50 @@ void rcu_unexpedite_gp(void) } EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); +/* + * Minimum time in milliseconds from the start boot until RCU can consider + * in-kernel boot as completed. This can also be tuned at runtime to end the + * boot earlier, by userspace init code writing the time in milliseconds (even + * 0) to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. The sysfs node + * can also be used to extend the delay to be larger than the default, assuming + * the marking of boot complete has not yet occurred. + */ +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; + static bool rcu_boot_ended __read_mostly; +static bool rcu_boot_end_called __read_mostly; +static DEFINE_MUTEX(rcu_boot_end_lock); /* - * Inform RCU of the end of the in-kernel boot sequence. + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. */ -void rcu_end_inkernel_boot(void) +void rcu_end_inkernel_boot(void); +static void rcu_boot_end_work_fn(struct work_struct *work) +{ + rcu_end_inkernel_boot(); +} +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); + +/* Must be called with rcu_boot_end_lock held. */ +static void rcu_end_inkernel_boot_locked(void) { + rcu_boot_end_called = true; + + if (rcu_boot_ended) + return; + + if (rcu_boot_end_delay) { + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); + + if (boot_ms < rcu_boot_end_delay) { + schedule_delayed_work(&rcu_boot_end_work, + rcu_boot_end_delay - boot_ms); + return; + } + } + + cancel_delayed_work(&rcu_boot_end_work); rcu_unexpedite_gp(); rcu_async_relax(); if (rcu_normal_after_boot) @@ -238,6 +275,39 @@ void rcu_end_inkernel_boot(void) rcu_boot_ended = true; } +void rcu_end_inkernel_boot(void) +{ + mutex_lock(&rcu_boot_end_lock); + rcu_end_inkernel_boot_locked(); + mutex_unlock(&rcu_boot_end_lock); +} + +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) +{ + uint end_ms; + int ret = kstrtouint(val, 0, &end_ms); + + if (ret) + return ret; + /* + * rcu_end_inkernel_boot() should be called at least once during init + * before we can allow param changes to end the boot. + */ + mutex_lock(&rcu_boot_end_lock); + rcu_boot_end_delay = end_ms; + if (!rcu_boot_ended && rcu_boot_end_called) { + rcu_end_inkernel_boot_locked(); + } + mutex_unlock(&rcu_boot_end_lock); + return ret; +} + +static const struct kernel_param_ops rcu_boot_end_ops = { + .set = param_set_rcu_boot_end, + .get = param_get_uint, +}; +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); + /* * Let rcutorture know when it is OK to turn it up to eleven. */
On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > On many systems, a great deal of boot (in userspace) happens after the > kernel thinks the boot has completed. It is difficult to determine if > the system has really booted from the kernel side. Some features like > lazy-RCU can risk slowing down boot time if, say, a callback has been > added that the boot synchronously depends on. Further expedited callbacks > can get unexpedited way earlier than it should be, thus slowing down > boot (as shown in the data below). > > For these reasons, this commit adds a config option > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > Userspace can also make RCU's view of the system as booted, by writing the > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > Or even just writing a value of 0 to this sysfs node. > However, under no circumstance will the boot be allowed to end earlier > than just before init is launched. > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > suites ChromeOS and also a PREEMPT_RT system below very well, which need > no config or parameter changes, and just a simple application of this patch. A > system designer can also choose a specific value here to keep RCU from marking > boot completion. As noted earlier, RCU's perspective of the system as booted > will not be marker until at least rcu_boot_end_delay milliseconds have passed > or an update is made via writing a small value (or 0) in milliseconds to: > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > One side-effect of this patch is, there is a risk that a real-time workload > launched just after the kernel boots will suffer interruptions due to expedited > RCU, which previous ended just before init was launched. However, to mitigate > such an issue (however unlikely), the user should either tune > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > boots, and before launching the real-time workload. > > Qiuxu also noted impressive boot-time improvements with earlier version > of patch. An excerpt from the data he shared: > > 1) Testing environment: > OS : CentOS Stream 8 (non-RT OS) > Kernel : v6.2 > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > 2) OS boot time definition: > The time from the start of the kernel boot to the shell command line > prompt is shown from the console. [ Different people may have > different OS boot time definitions. ] > > 3) Measurement method (very rough method): > A timer in the kernel periodically prints the boot time every 100ms. > As soon as the shell command line prompt is shown from the console, > we record the boot time printed by the timer, then the printed boot > time is the OS boot time. > > 4) Measured OS boot time (in seconds) > a) Measured 10 times w/o this patch: > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > The average OS boot time was: ~8.7s > > b) Measure 10 times w/ this patch: > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > The average OS boot time was: ~8.3s. > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > v1->v2: > Update some comments and description. > v2->v3: > Add sysfs param, and update with Test data. > > .../admin-guide/kernel-parameters.txt | 12 ++++ > cc_list | 8 +++ > kernel/rcu/Kconfig | 19 ++++++ > kernel/rcu/update.c | 68 ++++++++++++++++++- > 4 files changed, 106 insertions(+), 1 deletion(-) > create mode 100644 cc_list > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 2429b5e3184b..611de90d9c13 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5085,6 +5085,18 @@ > rcutorture.verbose= [KNL] > Enable additional printk() statements. > > + rcupdate.rcu_boot_end_delay= [KNL] > + Minimum time in milliseconds that must elapse > + before the boot sequence can be marked complete > + from RCU's perspective, after which RCU's behavior > + becomes more relaxed. The default value is also > + configurable via CONFIG_RCU_BOOT_END_DELAY. > + Userspace can also mark the boot as completed > + sooner by writing the time in milliseconds, say once > + userspace considers the system as booted, to: > + /sys/module/rcupdate/parameters/rcu_boot_end_delay > + Or even just writing a value of 0 to this sysfs node. > + > rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] > Dump ftrace buffer after reporting RCU CPU > stall warning. > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > index 9071182b1284..4b5ffa36cbaf 100644 > --- a/kernel/rcu/Kconfig > +++ b/kernel/rcu/Kconfig > @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY > > Accept the default if unsure. > > +config RCU_BOOT_END_DELAY > + int "Minimum time before RCU may consider in-kernel boot as completed" > + range 0 120000 > + default 15000 > + help > + Default value of the minimum time in milliseconds that must elapse > + before the boot sequence can be marked complete from RCU's perspective, > + after which RCU's behavior becomes more relaxed. > + Userspace can also mark the boot as completed sooner than this default > + by writing the time in milliseconds, say once userspace considers > + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. > + Or even just writing a value of 0 to this sysfs node. > + > + The actual delay for RCU's view of the system to be marked as booted can be > + higher than this value if the kernel takes a long time to initialize but it > + will never be smaller than this value. > + > + Accept the default if unsure. > + > config RCU_EXP_KTHREAD > bool "Perform RCU expedited work in a real-time kthread" > depends on RCU_BOOST && RCU_EXPERT > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 19bf6fa3ee6a..93138c92136e 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void) > } > EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); > > +/* > + * Minimum time in milliseconds until RCU can consider in-kernel boot as > + * completed. This can also be tuned at runtime to end the boot earlier, by > + * userspace init code writing the time in milliseconds (even 0) to: > + * /sys/module/rcupdate/parameters/rcu_boot_end_delay > + */ > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; > + > static bool rcu_boot_ended __read_mostly; > +static bool rcu_boot_end_called __read_mostly; > +static DEFINE_MUTEX(rcu_boot_end_lock); > + > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) > +{ > + uint end_ms; > + int ret = kstrtouint(val, 0, &end_ms); > + > + if (ret) > + return ret; > + WRITE_ONCE(*(uint *)kp->arg, end_ms); > + > + /* > + * rcu_end_inkernel_boot() should be called at least once during init > + * before we can allow param changes to end the boot. > + */ > + mutex_lock(&rcu_boot_end_lock); > + rcu_boot_end_delay = end_ms; > + if (!rcu_boot_ended && rcu_boot_end_called) { > + mutex_unlock(&rcu_boot_end_lock); > + rcu_end_inkernel_boot(); > + } > + mutex_unlock(&rcu_boot_end_lock); > + return ret; > +} > + > +static const struct kernel_param_ops rcu_boot_end_ops = { > + .set = param_set_rcu_boot_end, > + .get = param_get_uint, > +}; > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); > > /* > - * Inform RCU of the end of the in-kernel boot sequence. > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. > */ > +void rcu_end_inkernel_boot(void); > +static void rcu_boot_end_work_fn(struct work_struct *work) > +{ > + rcu_end_inkernel_boot(); > +} > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); > + > void rcu_end_inkernel_boot(void) > { > + mutex_lock(&rcu_boot_end_lock); > + rcu_boot_end_called = true; > + > + if (rcu_boot_ended) > + return; > + > + if (rcu_boot_end_delay) { > + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); > + > + if (boot_ms < rcu_boot_end_delay) { > + schedule_delayed_work(&rcu_boot_end_work, > + rcu_boot_end_delay - boot_ms); <snip> urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 93138c92136e..93f426f0f4ec 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void) if (boot_ms < rcu_boot_end_delay) { schedule_delayed_work(&rcu_boot_end_work, - rcu_boot_end_delay - boot_ms); + msecs_to_jiffies(rcu_boot_end_delay - boot_ms)); mutex_unlock(&rcu_boot_end_lock); return; } urezki@pc638:~/data/raid0/coding/linux-rcu.git$ <snip> I think you need to apply above patch. I am not sure maybe Paul has already mentioned about it. But just in case. -- Uladzislau Rezki
> On Mar 5, 2023, at 6:39 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: >> On many systems, a great deal of boot (in userspace) happens after the >> kernel thinks the boot has completed. It is difficult to determine if >> the system has really booted from the kernel side. Some features like >> lazy-RCU can risk slowing down boot time if, say, a callback has been >> added that the boot synchronously depends on. Further expedited callbacks >> can get unexpedited way earlier than it should be, thus slowing down >> boot (as shown in the data below). >> >> For these reasons, this commit adds a config option >> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. >> Userspace can also make RCU's view of the system as booted, by writing the >> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay >> Or even just writing a value of 0 to this sysfs node. >> However, under no circumstance will the boot be allowed to end earlier >> than just before init is launched. >> >> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This >> suites ChromeOS and also a PREEMPT_RT system below very well, which need >> no config or parameter changes, and just a simple application of this patch. A >> system designer can also choose a specific value here to keep RCU from marking >> boot completion. As noted earlier, RCU's perspective of the system as booted >> will not be marker until at least rcu_boot_end_delay milliseconds have passed >> or an update is made via writing a small value (or 0) in milliseconds to: >> /sys/module/rcupdate/parameters/rcu_boot_end_delay. >> >> One side-effect of this patch is, there is a risk that a real-time workload >> launched just after the kernel boots will suffer interruptions due to expedited >> RCU, which previous ended just before init was launched. However, to mitigate >> such an issue (however unlikely), the user should either tune >> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value >> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace >> boots, and before launching the real-time workload. >> >> Qiuxu also noted impressive boot-time improvements with earlier version >> of patch. An excerpt from the data he shared: >> >> 1) Testing environment: >> OS : CentOS Stream 8 (non-RT OS) >> Kernel : v6.2 >> Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) >> Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … >> >> 2) OS boot time definition: >> The time from the start of the kernel boot to the shell command line >> prompt is shown from the console. [ Different people may have >> different OS boot time definitions. ] >> >> 3) Measurement method (very rough method): >> A timer in the kernel periodically prints the boot time every 100ms. >> As soon as the shell command line prompt is shown from the console, >> we record the boot time printed by the timer, then the printed boot >> time is the OS boot time. >> >> 4) Measured OS boot time (in seconds) >> a) Measured 10 times w/o this patch: >> 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s >> The average OS boot time was: ~8.7s >> >> b) Measure 10 times w/ this patch: >> 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s >> The average OS boot time was: ~8.3s. >> >> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> >> --- >> v1->v2: >> Update some comments and description. >> v2->v3: >> Add sysfs param, and update with Test data. >> >> .../admin-guide/kernel-parameters.txt | 12 ++++ >> cc_list | 8 +++ >> kernel/rcu/Kconfig | 19 ++++++ >> kernel/rcu/update.c | 68 ++++++++++++++++++- >> 4 files changed, 106 insertions(+), 1 deletion(-) >> create mode 100644 cc_list >> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt >> index 2429b5e3184b..611de90d9c13 100644 >> --- a/Documentation/admin-guide/kernel-parameters.txt >> +++ b/Documentation/admin-guide/kernel-parameters.txt >> @@ -5085,6 +5085,18 @@ >> rcutorture.verbose= [KNL] >> Enable additional printk() statements. >> >> + rcupdate.rcu_boot_end_delay= [KNL] >> + Minimum time in milliseconds that must elapse >> + before the boot sequence can be marked complete >> + from RCU's perspective, after which RCU's behavior >> + becomes more relaxed. The default value is also >> + configurable via CONFIG_RCU_BOOT_END_DELAY. >> + Userspace can also mark the boot as completed >> + sooner by writing the time in milliseconds, say once >> + userspace considers the system as booted, to: >> + /sys/module/rcupdate/parameters/rcu_boot_end_delay >> + Or even just writing a value of 0 to this sysfs node. >> + >> rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] >> Dump ftrace buffer after reporting RCU CPU >> stall warning. >> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig >> index 9071182b1284..4b5ffa36cbaf 100644 >> --- a/kernel/rcu/Kconfig >> +++ b/kernel/rcu/Kconfig >> @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY >> >> Accept the default if unsure. >> >> +config RCU_BOOT_END_DELAY >> + int "Minimum time before RCU may consider in-kernel boot as completed" >> + range 0 120000 >> + default 15000 >> + help >> + Default value of the minimum time in milliseconds that must elapse >> + before the boot sequence can be marked complete from RCU's perspective, >> + after which RCU's behavior becomes more relaxed. >> + Userspace can also mark the boot as completed sooner than this default >> + by writing the time in milliseconds, say once userspace considers >> + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. >> + Or even just writing a value of 0 to this sysfs node. >> + >> + The actual delay for RCU's view of the system to be marked as booted can be >> + higher than this value if the kernel takes a long time to initialize but it >> + will never be smaller than this value. >> + >> + Accept the default if unsure. >> + >> config RCU_EXP_KTHREAD >> bool "Perform RCU expedited work in a real-time kthread" >> depends on RCU_BOOST && RCU_EXPERT >> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c >> index 19bf6fa3ee6a..93138c92136e 100644 >> --- a/kernel/rcu/update.c >> +++ b/kernel/rcu/update.c >> @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void) >> } >> EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); >> >> +/* >> + * Minimum time in milliseconds until RCU can consider in-kernel boot as >> + * completed. This can also be tuned at runtime to end the boot earlier, by >> + * userspace init code writing the time in milliseconds (even 0) to: >> + * /sys/module/rcupdate/parameters/rcu_boot_end_delay >> + */ >> +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; >> + >> static bool rcu_boot_ended __read_mostly; >> +static bool rcu_boot_end_called __read_mostly; >> +static DEFINE_MUTEX(rcu_boot_end_lock); >> + >> +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) >> +{ >> + uint end_ms; >> + int ret = kstrtouint(val, 0, &end_ms); >> + >> + if (ret) >> + return ret; >> + WRITE_ONCE(*(uint *)kp->arg, end_ms); >> + >> + /* >> + * rcu_end_inkernel_boot() should be called at least once during init >> + * before we can allow param changes to end the boot. >> + */ >> + mutex_lock(&rcu_boot_end_lock); >> + rcu_boot_end_delay = end_ms; >> + if (!rcu_boot_ended && rcu_boot_end_called) { >> + mutex_unlock(&rcu_boot_end_lock); >> + rcu_end_inkernel_boot(); >> + } >> + mutex_unlock(&rcu_boot_end_lock); >> + return ret; >> +} >> + >> +static const struct kernel_param_ops rcu_boot_end_ops = { >> + .set = param_set_rcu_boot_end, >> + .get = param_get_uint, >> +}; >> +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); >> >> /* >> - * Inform RCU of the end of the in-kernel boot sequence. >> + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will >> + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. >> */ >> +void rcu_end_inkernel_boot(void); >> +static void rcu_boot_end_work_fn(struct work_struct *work) >> +{ >> + rcu_end_inkernel_boot(); >> +} >> +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); >> + >> void rcu_end_inkernel_boot(void) >> { >> + mutex_lock(&rcu_boot_end_lock); >> + rcu_boot_end_called = true; >> + >> + if (rcu_boot_ended) >> + return; >> + >> + if (rcu_boot_end_delay) { >> + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); >> + >> + if (boot_ms < rcu_boot_end_delay) { >> + schedule_delayed_work(&rcu_boot_end_work, >> + rcu_boot_end_delay - boot_ms); > <snip> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 93138c92136e..93f426f0f4ec 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void) > > if (boot_ms < rcu_boot_end_delay) { > schedule_delayed_work(&rcu_boot_end_work, > - rcu_boot_end_delay - boot_ms); > + msecs_to_jiffies(rcu_boot_end_delay - boot_ms)); > mutex_unlock(&rcu_boot_end_lock); > return; > } > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ > <snip> > > I think you need to apply above patch. I am not sure maybe Paul > has already mentioned about it. But just in case. Ah, the reason my testing did not catch it is because for HZ=1000, msecs and jiffies are the same. Great eyes and thank you Vlad, I’ll make the fix and repost it. - Joel > > -- > Uladzislau Rezki
On Sun, Mar 05, 2023 at 12:39:01PM +0100, Uladzislau Rezki wrote: > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > On many systems, a great deal of boot (in userspace) happens after the > > kernel thinks the boot has completed. It is difficult to determine if > > the system has really booted from the kernel side. Some features like > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > added that the boot synchronously depends on. Further expedited callbacks > > can get unexpedited way earlier than it should be, thus slowing down > > boot (as shown in the data below). > > > > For these reasons, this commit adds a config option > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > Userspace can also make RCU's view of the system as booted, by writing the > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > Or even just writing a value of 0 to this sysfs node. > > However, under no circumstance will the boot be allowed to end earlier > > than just before init is launched. > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > no config or parameter changes, and just a simple application of this patch. A > > system designer can also choose a specific value here to keep RCU from marking > > boot completion. As noted earlier, RCU's perspective of the system as booted > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > or an update is made via writing a small value (or 0) in milliseconds to: > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > One side-effect of this patch is, there is a risk that a real-time workload > > launched just after the kernel boots will suffer interruptions due to expedited > > RCU, which previous ended just before init was launched. However, to mitigate > > such an issue (however unlikely), the user should either tune > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > boots, and before launching the real-time workload. > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > of patch. An excerpt from the data he shared: > > > > 1) Testing environment: > > OS : CentOS Stream 8 (non-RT OS) > > Kernel : v6.2 > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > 2) OS boot time definition: > > The time from the start of the kernel boot to the shell command line > > prompt is shown from the console. [ Different people may have > > different OS boot time definitions. ] > > > > 3) Measurement method (very rough method): > > A timer in the kernel periodically prints the boot time every 100ms. > > As soon as the shell command line prompt is shown from the console, > > we record the boot time printed by the timer, then the printed boot > > time is the OS boot time. > > > > 4) Measured OS boot time (in seconds) > > a) Measured 10 times w/o this patch: > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > The average OS boot time was: ~8.7s > > > > b) Measure 10 times w/ this patch: > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > The average OS boot time was: ~8.3s. > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > v1->v2: > > Update some comments and description. > > v2->v3: > > Add sysfs param, and update with Test data. > > > > .../admin-guide/kernel-parameters.txt | 12 ++++ > > cc_list | 8 +++ > > kernel/rcu/Kconfig | 19 ++++++ > > kernel/rcu/update.c | 68 ++++++++++++++++++- > > 4 files changed, 106 insertions(+), 1 deletion(-) > > create mode 100644 cc_list > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 2429b5e3184b..611de90d9c13 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5085,6 +5085,18 @@ > > rcutorture.verbose= [KNL] > > Enable additional printk() statements. > > > > + rcupdate.rcu_boot_end_delay= [KNL] > > + Minimum time in milliseconds that must elapse > > + before the boot sequence can be marked complete > > + from RCU's perspective, after which RCU's behavior > > + becomes more relaxed. The default value is also > > + configurable via CONFIG_RCU_BOOT_END_DELAY. > > + Userspace can also mark the boot as completed > > + sooner by writing the time in milliseconds, say once > > + userspace considers the system as booted, to: > > + /sys/module/rcupdate/parameters/rcu_boot_end_delay > > + Or even just writing a value of 0 to this sysfs node. > > + > > rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] > > Dump ftrace buffer after reporting RCU CPU > > stall warning. > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > > index 9071182b1284..4b5ffa36cbaf 100644 > > --- a/kernel/rcu/Kconfig > > +++ b/kernel/rcu/Kconfig > > @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY > > > > Accept the default if unsure. > > > > +config RCU_BOOT_END_DELAY > > + int "Minimum time before RCU may consider in-kernel boot as completed" > > + range 0 120000 > > + default 15000 > > + help > > + Default value of the minimum time in milliseconds that must elapse > > + before the boot sequence can be marked complete from RCU's perspective, > > + after which RCU's behavior becomes more relaxed. > > + Userspace can also mark the boot as completed sooner than this default > > + by writing the time in milliseconds, say once userspace considers > > + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > + Or even just writing a value of 0 to this sysfs node. > > + > > + The actual delay for RCU's view of the system to be marked as booted can be > > + higher than this value if the kernel takes a long time to initialize but it > > + will never be smaller than this value. > > + > > + Accept the default if unsure. > > + > > config RCU_EXP_KTHREAD > > bool "Perform RCU expedited work in a real-time kthread" > > depends on RCU_BOOST && RCU_EXPERT > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > > index 19bf6fa3ee6a..93138c92136e 100644 > > --- a/kernel/rcu/update.c > > +++ b/kernel/rcu/update.c > > @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void) > > } > > EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); > > > > +/* > > + * Minimum time in milliseconds until RCU can consider in-kernel boot as > > + * completed. This can also be tuned at runtime to end the boot earlier, by > > + * userspace init code writing the time in milliseconds (even 0) to: > > + * /sys/module/rcupdate/parameters/rcu_boot_end_delay > > + */ > > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; > > + > > static bool rcu_boot_ended __read_mostly; > > +static bool rcu_boot_end_called __read_mostly; > > +static DEFINE_MUTEX(rcu_boot_end_lock); > > + > > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) > > +{ > > + uint end_ms; > > + int ret = kstrtouint(val, 0, &end_ms); > > + > > + if (ret) > > + return ret; > > + WRITE_ONCE(*(uint *)kp->arg, end_ms); > > + > > + /* > > + * rcu_end_inkernel_boot() should be called at least once during init > > + * before we can allow param changes to end the boot. > > + */ > > + mutex_lock(&rcu_boot_end_lock); > > + rcu_boot_end_delay = end_ms; > > + if (!rcu_boot_ended && rcu_boot_end_called) { > > + mutex_unlock(&rcu_boot_end_lock); > > + rcu_end_inkernel_boot(); > > + } > > + mutex_unlock(&rcu_boot_end_lock); > > + return ret; > > +} > > + > > +static const struct kernel_param_ops rcu_boot_end_ops = { > > + .set = param_set_rcu_boot_end, > > + .get = param_get_uint, > > +}; > > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); > > > > /* > > - * Inform RCU of the end of the in-kernel boot sequence. > > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will > > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. > > */ > > +void rcu_end_inkernel_boot(void); > > +static void rcu_boot_end_work_fn(struct work_struct *work) > > +{ > > + rcu_end_inkernel_boot(); > > +} > > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); > > + > > void rcu_end_inkernel_boot(void) > > { > > + mutex_lock(&rcu_boot_end_lock); > > + rcu_boot_end_called = true; > > + > > + if (rcu_boot_ended) > > + return; > > + > > + if (rcu_boot_end_delay) { > > + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); > > + > > + if (boot_ms < rcu_boot_end_delay) { > > + schedule_delayed_work(&rcu_boot_end_work, > > + rcu_boot_end_delay - boot_ms); > <snip> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 93138c92136e..93f426f0f4ec 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void) > > if (boot_ms < rcu_boot_end_delay) { > schedule_delayed_work(&rcu_boot_end_work, > - rcu_boot_end_delay - boot_ms); > + msecs_to_jiffies(rcu_boot_end_delay - boot_ms)); > mutex_unlock(&rcu_boot_end_lock); > return; > } > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ > <snip> > > I think you need to apply above patch. I am not sure maybe Paul > has already mentioned about it. But just in case. No, I did miss that one, so thank you very much for spotting it! Thanx, Paul
> From: Paul E. McKenney <paulmck@kernel.org> > [...] > > Qiuxu also noted impressive boot-time improvements with earlier > > version of patch. An excerpt from the data he shared: > > > > 1) Testing environment: > > OS : CentOS Stream 8 (non-RT OS) > > Kernel : v6.2 > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical > threads) > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, > > … > > > > 2) OS boot time definition: > > The time from the start of the kernel boot to the shell command line > > prompt is shown from the console. [ Different people may have > > different OS boot time definitions. ] > > > > 3) Measurement method (very rough method): > > A timer in the kernel periodically prints the boot time every 100ms. > > As soon as the shell command line prompt is shown from the console, > > we record the boot time printed by the timer, then the printed boot > > time is the OS boot time. > > > > 4) Measured OS boot time (in seconds) > > a) Measured 10 times w/o this patch: > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > The average OS boot time was: ~8.7s > > > > b) Measure 10 times w/ this patch: > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > The average OS boot time was: ~8.3s. > > Unfortunately, given that a's average is within one standard deviation of b's > average, this is most definitely not statistically significant. > Especially given only ten measurements for each case -- you need *at* > *least* 24, preferably more. Especially in this case, where you don't really > know what the underlying distribution is. Thank you so much Paul for the detailed comments on the measured data. I'm curious how did you figure out the number 24 that we at *least* need. This can guide me on whether the number of samples is enough for future testing ;-). I did another 48 measurements (2x of 24) for each case (w/o and w/ Joel's v2 patch) as below. All the testing configurations for the new testing are the same as before. a) Measured 48 times w/o v2 patch (in seconds): 8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4, 8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8, 8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8, 8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6, 8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0, 9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7 The average OS boot time was: ~9.0s b) Measure 48 times w/ v2 patch (in seconds): 7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2, 9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2, 8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3, 8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0, 8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8, 8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3, The average OS boot time was: ~8.5s @Joel Fernandes (Google), you may replace my old data with the above new data in your commit message. > But we can apply the binomial distribution instead of the usual normal > distribution. First, let's sort and take the medians: > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3 Median: 8.7 > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3 Median: 8.2 > > 8/10 of a's data points are greater than 0.1 more than b's median and 8/10 > of b's data points are less than 0.1 less than a's median. > What are the odds that this happens by random chance? > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055. What's the meaning of 0.5 here? Was it the probability (we assume?) that each time b's data point failed (or didn't satisfy) "less than 0.1 less than a's median"? > This is not quite 95% confidence, so not hugely convincing, but it is at least > close. Not that this is the confidence that (b) is 100ms faster than (a), not > just that (b) is faster than (a). > > Not sure that this really carries its weight, but in contrast to the usual > statistics based on the normal distribution, it does suggest at least a little > improvement. On the other hand, anyone who has carefully studied > nonparametric statistics probably jumped out of the boat several paragraphs > ago. ;-) > > A few more questions interspersed below. > > Thanx, Paul > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> From: Joel Fernandes <joel@joelfernandes.org> > [...] > > --- a/kernel/rcu/update.c > > +++ b/kernel/rcu/update.c > > @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void) > > > > if (boot_ms < rcu_boot_end_delay) { > > schedule_delayed_work(&rcu_boot_end_work, > > - rcu_boot_end_delay - boot_ms); > > + msecs_to_jiffies(rcu_boot_end_delay - > > + boot_ms)); > > mutex_unlock(&rcu_boot_end_lock); > > return; > > } > > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ > > <snip> > > > > I think you need to apply above patch. I am not sure maybe Paul has > > already mentioned about it. But just in case. > > Ah, the reason my testing did not catch it is because for HZ=1000, msecs and > jiffies are the same. So was my system :-) CONFIG_HZ_1000=y CONFIG_HZ=1000 -Qiuxu > Great eyes and thank you Vlad, I’ll make the fix and repost it. > > - Joel > > > > > -- > > Uladzislau Rezki
On Mon, Mar 06, 2023 at 08:24:44AM +0000, Zhuo, Qiuxu wrote: > > From: Paul E. McKenney <paulmck@kernel.org> > > [...] > > > Qiuxu also noted impressive boot-time improvements with earlier > > > version of patch. An excerpt from the data he shared: > > > > > > 1) Testing environment: > > > OS : CentOS Stream 8 (non-RT OS) > > > Kernel : v6.2 > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical > > threads) > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, > > > … > > > > > > 2) OS boot time definition: > > > The time from the start of the kernel boot to the shell command line > > > prompt is shown from the console. [ Different people may have > > > different OS boot time definitions. ] > > > > > > 3) Measurement method (very rough method): > > > A timer in the kernel periodically prints the boot time every 100ms. > > > As soon as the shell command line prompt is shown from the console, > > > we record the boot time printed by the timer, then the printed boot > > > time is the OS boot time. > > > > > > 4) Measured OS boot time (in seconds) > > > a) Measured 10 times w/o this patch: > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > The average OS boot time was: ~8.7s > > > > > > b) Measure 10 times w/ this patch: > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > The average OS boot time was: ~8.3s. > > > > Unfortunately, given that a's average is within one standard deviation of b's > > average, this is most definitely not statistically significant. > > Especially given only ten measurements for each case -- you need *at* > > *least* 24, preferably more. Especially in this case, where you don't really > > know what the underlying distribution is. > > Thank you so much Paul for the detailed comments on the measured data. > > I'm curious how did you figure out the number 24 that we at *least* need. > This can guide me on whether the number of samples is enough for > future testing ;-). It is a rough rule of thumb. For more details and accuracy, study up on the Student's t-test and related statistical tests. Of course, this all assumes that the data fits a normal distribution. > I did another 48 measurements (2x of 24) for each case > (w/o and w/ Joel's v2 patch) as below. > All the testing configurations for the new testing > are the same as before. > > a) Measured 48 times w/o v2 patch (in seconds): > 8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4, > 8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8, > 8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8, > 8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6, > 8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0, > 9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7 > The average OS boot time was: ~9.0s The range is 8.2 through 9.8. > b) Measure 48 times w/ v2 patch (in seconds): > 7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2, > 9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2, > 8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3, > 8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0, > 8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8, > 8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3, > The average OS boot time was: ~8.5s The range is 7.7 through 9.8. There is again significant overlap, so it is again unclear that you have a statistically significant difference. So could you please calculate the standard deviations? > @Joel Fernandes (Google), you may replace my old data with the above > new data in your commit message. > > > But we can apply the binomial distribution instead of the usual normal > > distribution. First, let's sort and take the medians: > > > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3 Median: 8.7 > > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3 Median: 8.2 > > > > 8/10 of a's data points are greater than 0.1 more than b's median and 8/10 > > of b's data points are less than 0.1 less than a's median. > > What are the odds that this happens by random chance? > > > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055. > > What's the meaning of 0.5 here? Was it the probability (we assume?) that > each time b's data point failed (or didn't satisfy) "less than 0.1 less than > a's median"? The meaning of 0.5 is the probability of a given data point being on one side or the other of the corresponding distribution's median. This of course assumes that the median of the measured data matches that of the corresponding distribution, though the fact that the median is also a mode of both of the old data sets gives some hope. The meaning of the 0.1 is the smallest difference that the data could measure. I could have instead chosen 0.0 and asked if there was likely some (perhaps tiny) difference, but instead, I chose to ask if there was likely some small but meaningful difference. It is better to choose the desired difference before measuring the data. Why don't you try applying this approach to the new data? You will need the general binomial formula. Thanx, Paul > > This is not quite 95% confidence, so not hugely convincing, but it is at least > > close. Not that this is the confidence that (b) is 100ms faster than (a), not > > just that (b) is faster than (a). > > > > Not sure that this really carries its weight, but in contrast to the usual > > statistics based on the normal distribution, it does suggest at least a little > > improvement. On the other hand, anyone who has carefully studied > > nonparametric statistics probably jumped out of the boat several paragraphs > > ago. ;-) > > > > A few more questions interspersed below. > > > > Thanx, Paul > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> >
> From: Paul E. McKenney <paulmck@kernel.org> > [...] > > > > Thank you so much Paul for the detailed comments on the measured data. > > > > I'm curious how did you figure out the number 24 that we at *least* need. > > This can guide me on whether the number of samples is enough for > > future testing ;-). > > It is a rough rule of thumb. For more details and accuracy, study up on the > Student's t-test and related statistical tests. > > Of course, this all assumes that the data fits a normal distribution. Thanks for this extra information. Good to know the Student's t-test. > > I did another 48 measurements (2x of 24) for each case (w/o and w/ > > Joel's v2 patch) as below. > > All the testing configurations for the new testing are the same as > > before. > > > > a) Measured 48 times w/o v2 patch (in seconds): > > 8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4, > > 8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8, > > 8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8, > > 8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6, > > 8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0, > > 9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7 > > The average OS boot time was: ~9.0s > > The range is 8.2 through 9.8. > > > b) Measure 48 times w/ v2 patch (in seconds): > > 7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2, > > 9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2, > > 8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3, > > 8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0, > > 8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8, > > 8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3, > > The average OS boot time was: ~8.5s > > The range is 7.7 through 9.8. > > There is again significant overlap, so it is again unclear that you have a > statistically significant difference. So could you please calculate the standard > deviations? a's standard deviation is ~0.4. b's standard deviation is ~0.5. a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. So, the measurements should be statistically significant to some degree. The calculated standard deviations are via: https://www.gigacalculator.com/calculators/standard-deviation-calculator.php > > @Joel Fernandes (Google), you may replace my old data with the above > > new data in your commit message. > > > > > But we can apply the binomial distribution instead of the usual > > > normal distribution. First, let's sort and take the medians: > > > > > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3 Median: 8.7 > > > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3 Median: 8.2 > > > > > > 8/10 of a's data points are greater than 0.1 more than b's median > > > and 8/10 of b's data points are less than 0.1 less than a's median. > > > What are the odds that this happens by random chance? > > > > > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055. > > > > What's the meaning of 0.5 here? Was it the probability (we assume?) > > that each time b's data point failed (or didn't satisfy) "less than > > 0.1 less than a's median"? > > The meaning of 0.5 is the probability of a given data point being on one side > or the other of the corresponding distribution's median. This of course > assumes that the median of the measured data matches that of the > corresponding distribution, though the fact that the median is also a mode of > both of the old data sets gives some hope. Thanks for the detailed comments on the meaning of 0.5 here. :-) > The meaning of the 0.1 is the smallest difference that the data could measure. > I could have instead chosen 0.0 and asked if there was likely some (perhaps > tiny) difference, but instead, I chose to ask if there was likely some small but > meaningful difference. It is better to choose the desired difference before > measuring the data. Thanks for the detailed comments on the meaning of 0.1 here. :-) > Why don't you try applying this approach to the new data? You will need the > general binomial formula. Thank you Paul for the suggestion. I just tried it, but not sure whether my analysis was correct ... Analysis 1: a's median is 8.9. 35/48 b's data points are less than 0.1 less than a's median. For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5. So, we have strong confidence that b is 100ms faster than a. Analysis 2: a's median - 0.4 = 8.9 - 0.4 = 8.5. 24/48 b's data points are less than 0.4 less than a's median. The probability that a's data points are less than 8.5 is p = 7/48 = 0.1458 For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458. So, looks like we have confidence that b is 400ms faster than a. The calculated cumulative binomial distributions P(X) is via: https://www.gigacalculator.com/calculators/binomial-probability-calculator.php I apologize if this analysis/discussion bored some of you. ;-) -Qiuxu > [...]
On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > On many systems, a great deal of boot (in userspace) happens after the > kernel thinks the boot has completed. It is difficult to determine if > the system has really booted from the kernel side. Some features like > lazy-RCU can risk slowing down boot time if, say, a callback has been > added that the boot synchronously depends on. Further expedited callbacks > can get unexpedited way earlier than it should be, thus slowing down > boot (as shown in the data below). > > For these reasons, this commit adds a config option > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > Userspace can also make RCU's view of the system as booted, by writing the > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > Or even just writing a value of 0 to this sysfs node. > However, under no circumstance will the boot be allowed to end earlier > than just before init is launched. > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > suites ChromeOS and also a PREEMPT_RT system below very well, which need > no config or parameter changes, and just a simple application of this patch. A > system designer can also choose a specific value here to keep RCU from marking > boot completion. As noted earlier, RCU's perspective of the system as booted > will not be marker until at least rcu_boot_end_delay milliseconds have passed > or an update is made via writing a small value (or 0) in milliseconds to: > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > One side-effect of this patch is, there is a risk that a real-time workload > launched just after the kernel boots will suffer interruptions due to expedited > RCU, which previous ended just before init was launched. However, to mitigate > such an issue (however unlikely), the user should either tune > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > boots, and before launching the real-time workload. > > Qiuxu also noted impressive boot-time improvements with earlier version > of patch. An excerpt from the data he shared: > > 1) Testing environment: > OS : CentOS Stream 8 (non-RT OS) > Kernel : v6.2 > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > 2) OS boot time definition: > The time from the start of the kernel boot to the shell command line > prompt is shown from the console. [ Different people may have > different OS boot time definitions. ] > > 3) Measurement method (very rough method): > A timer in the kernel periodically prints the boot time every 100ms. > As soon as the shell command line prompt is shown from the console, > we record the boot time printed by the timer, then the printed boot > time is the OS boot time. > > 4) Measured OS boot time (in seconds) > a) Measured 10 times w/o this patch: > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > The average OS boot time was: ~8.7s > > b) Measure 10 times w/ this patch: > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > The average OS boot time was: ~8.3s. > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> I still don't really like that: 1) It feels like we are curing a symptom for which we don't know the cause. Which RCU write side caller is the source of this slow boot? Some tracepoints reporting the wait duration within synchronize_rcu() calls between the end of the kernel boot and the end of userspace boot may be helpful. 2) The kernel boot was already covered before this patch so this is about userspace code calling into the kernel. Is that piece of code also called after the boot? In that case are we missing a conversion from synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then the problem is more general than just boot. This needs to be analyzed first and if it happens that the issue really needs to be fixed with telling the kernel that userspace has completed booting, eg: because the problem is not in a few callsites that need conversion to expedited but instead in the accumulation of lots of calls that should stay as is: 3) This arbitrary timeout looks dangerous to me as latency sensitive code may run right after the boot. Either you choose a value that is too low and you miss the optimization or the value is too high and you may break things. 4) This should be fixed the way you did: a) a kernel parameter like you did b) The init process (systemd?) tells the kernel when it judges that userspace has completed booting. c) Make these interfaces more generic, maybe that information will be useful outside RCU. For example the kernel parameter should be "user_booted_reported" and the sysfs (should be sysctl?): kernel.user_booted = 1 d) But yuck, this means we must know if the init process supports that... For these reasons, let's make sure we know exactly what is going on first. Thanks.
On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > On many systems, a great deal of boot (in userspace) happens after the > > kernel thinks the boot has completed. It is difficult to determine if > > the system has really booted from the kernel side. Some features like > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > added that the boot synchronously depends on. Further expedited callbacks > > can get unexpedited way earlier than it should be, thus slowing down > > boot (as shown in the data below). > > > > For these reasons, this commit adds a config option > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > Userspace can also make RCU's view of the system as booted, by writing the > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > Or even just writing a value of 0 to this sysfs node. > > However, under no circumstance will the boot be allowed to end earlier > > than just before init is launched. > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > no config or parameter changes, and just a simple application of this patch. A > > system designer can also choose a specific value here to keep RCU from marking > > boot completion. As noted earlier, RCU's perspective of the system as booted > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > or an update is made via writing a small value (or 0) in milliseconds to: > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > One side-effect of this patch is, there is a risk that a real-time workload > > launched just after the kernel boots will suffer interruptions due to expedited > > RCU, which previous ended just before init was launched. However, to mitigate > > such an issue (however unlikely), the user should either tune > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > boots, and before launching the real-time workload. > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > of patch. An excerpt from the data he shared: > > > > 1) Testing environment: > > OS : CentOS Stream 8 (non-RT OS) > > Kernel : v6.2 > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > 2) OS boot time definition: > > The time from the start of the kernel boot to the shell command line > > prompt is shown from the console. [ Different people may have > > different OS boot time definitions. ] > > > > 3) Measurement method (very rough method): > > A timer in the kernel periodically prints the boot time every 100ms. > > As soon as the shell command line prompt is shown from the console, > > we record the boot time printed by the timer, then the printed boot > > time is the OS boot time. > > > > 4) Measured OS boot time (in seconds) > > a) Measured 10 times w/o this patch: > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > The average OS boot time was: ~8.7s > > > > b) Measure 10 times w/ this patch: > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > The average OS boot time was: ~8.3s. > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > I still don't really like that: > > 1) It feels like we are curing a symptom for which we don't know the cause. > Which RCU write side caller is the source of this slow boot? Some tracepoints > reporting the wait duration within synchronize_rcu() calls between the end of > the kernel boot and the end of userspace boot may be helpful. > > 2) The kernel boot was already covered before this patch so this is about > userspace code calling into the kernel. Is that piece of code also called > after the boot? In that case are we missing a conversion from > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > the problem is more general than just boot. > > This needs to be analyzed first and if it happens that the issue really > needs to be fixed with telling the kernel that userspace has completed > booting, eg: because the problem is not in a few callsites that need conversion > to expedited but instead in the accumulation of lots of calls that should stay > as is: > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > may run right after the boot. Either you choose a value that is too low > and you miss the optimization or the value is too high and you may break > things. > > 4) This should be fixed the way you did: > a) a kernel parameter like you did > b) The init process (systemd?) tells the kernel when it judges that userspace > has completed booting. > c) Make these interfaces more generic, maybe that information will be useful > outside RCU. For example the kernel parameter should be > "user_booted_reported" and the sysfs (should be sysctl?): > kernel.user_booted = 1 > d) But yuck, this means we must know if the init process supports that... > > For these reasons, let's make sure we know exactly what is going on first. > > Thanks. Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 parameter that can be used during the boot. For example on our devices to speedup a boot we boot the kernel with rcu_expedited: XQ-DQ54:/ # cat /proc/cmdline stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001 XQ-DQ54:/ # then a user space can decides if it is needed or not: <snip> rcu_expedited rcu_normal XQ-DQ54:/ # ls -al /sys/kernel/rcu_* -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal XQ-DQ54:/ # <snip> for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with true or false. So we can follow and be aligned with rcu_expedited and rcu_normal parameters. -- Uladzislau Rezki
On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > On many systems, a great deal of boot (in userspace) happens after the > > kernel thinks the boot has completed. It is difficult to determine if > > the system has really booted from the kernel side. Some features like > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > added that the boot synchronously depends on. Further expedited callbacks > > can get unexpedited way earlier than it should be, thus slowing down > > boot (as shown in the data below). > > > > For these reasons, this commit adds a config option > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > Userspace can also make RCU's view of the system as booted, by writing the > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > Or even just writing a value of 0 to this sysfs node. > > However, under no circumstance will the boot be allowed to end earlier > > than just before init is launched. > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > no config or parameter changes, and just a simple application of this patch. A > > system designer can also choose a specific value here to keep RCU from marking > > boot completion. As noted earlier, RCU's perspective of the system as booted > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > or an update is made via writing a small value (or 0) in milliseconds to: > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > One side-effect of this patch is, there is a risk that a real-time workload > > launched just after the kernel boots will suffer interruptions due to expedited > > RCU, which previous ended just before init was launched. However, to mitigate > > such an issue (however unlikely), the user should either tune > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > boots, and before launching the real-time workload. > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > of patch. An excerpt from the data he shared: > > > > 1) Testing environment: > > OS : CentOS Stream 8 (non-RT OS) > > Kernel : v6.2 > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > 2) OS boot time definition: > > The time from the start of the kernel boot to the shell command line > > prompt is shown from the console. [ Different people may have > > different OS boot time definitions. ] > > > > 3) Measurement method (very rough method): > > A timer in the kernel periodically prints the boot time every 100ms. > > As soon as the shell command line prompt is shown from the console, > > we record the boot time printed by the timer, then the printed boot > > time is the OS boot time. > > > > 4) Measured OS boot time (in seconds) > > a) Measured 10 times w/o this patch: > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > The average OS boot time was: ~8.7s > > > > b) Measure 10 times w/ this patch: > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > The average OS boot time was: ~8.3s. > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > I still don't really like that: > > 1) It feels like we are curing a symptom for which we don't know the cause. > Which RCU write side caller is the source of this slow boot? Some tracepoints > reporting the wait duration within synchronize_rcu() calls between the end of > the kernel boot and the end of userspace boot may be helpful. Just to clarify (and I feel we discussed this recently) -- there is no callback I am aware of right now causing a slow boot. The reason for doing this is we don't have such issues in the future; so it is a protection. Note the repeated call outs to the scsi callback and also the rcu_barrier() issue previously fixed. Further, we already see slight improvements in boot times with disabling lazy during boot (its not much but its there). Yes, we should fix issues instead of hiding them - but we also would like to improve the user experience -- just like we disable lazy and expedited during suspend. So what is the problem that you really have with this patch even with data showing improvements? I actually wanted a mechanism like this from the beginning and was trying to get Intel to write the patch, but I ended up writing it. > 2) The kernel boot was already covered before this patch so this is about > userspace code calling into the kernel. Is that piece of code also called > after the boot? In that case are we missing a conversion from > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > the problem is more general than just boot. > > This needs to be analyzed first and if it happens that the issue really > needs to be fixed with telling the kernel that userspace has completed > booting, eg: because the problem is not in a few callsites that need conversion > to expedited but instead in the accumulation of lots of calls that should stay > as is: There is no such callback I am aware off that needs such a conversion and I don't think that will help give any guarantees because there is no preventing someone from adding a callback that synchronously slows boot. The approach here is to put a protection. However, I will do some more investigations into what else may be slowing things as I do hold a lot of weight for your words! :) > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > may run right after the boot. Either you choose a value that is too low > and you miss the optimization or the value is too high and you may break > things. So someone is presenting a timing sensitive workload within 15 seconds of boot? Please provide some evidence of that. The only evidence right now is on the plus side even for the RT system. > 4) This should be fixed the way you did: > a) a kernel parameter like you did > b) The init process (systemd?) tells the kernel when it judges that userspace > has completed booting. > c) Make these interfaces more generic, maybe that information will be useful > outside RCU. For example the kernel parameter should be > "user_booted_reported" and the sysfs (should be sysctl?): > kernel.user_booted = 1 > d) But yuck, this means we must know if the init process supports that... > > For these reasons, let's make sure we know exactly what is going on first. I can investigate this more and get back to you. One of the challenges is getting boot tracing working properly. Systems do weird things like turning off tracing during boot and/or clearing trace buffers. - Joel
On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > On many systems, a great deal of boot (in userspace) happens after the > > > kernel thinks the boot has completed. It is difficult to determine if > > > the system has really booted from the kernel side. Some features like > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > added that the boot synchronously depends on. Further expedited callbacks > > > can get unexpedited way earlier than it should be, thus slowing down > > > boot (as shown in the data below). > > > > > > For these reasons, this commit adds a config option > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > Userspace can also make RCU's view of the system as booted, by writing the > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > Or even just writing a value of 0 to this sysfs node. > > > However, under no circumstance will the boot be allowed to end earlier > > > than just before init is launched. > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > no config or parameter changes, and just a simple application of this patch. A > > > system designer can also choose a specific value here to keep RCU from marking > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > launched just after the kernel boots will suffer interruptions due to expedited > > > RCU, which previous ended just before init was launched. However, to mitigate > > > such an issue (however unlikely), the user should either tune > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > boots, and before launching the real-time workload. > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > of patch. An excerpt from the data he shared: > > > > > > 1) Testing environment: > > > OS : CentOS Stream 8 (non-RT OS) > > > Kernel : v6.2 > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > 2) OS boot time definition: > > > The time from the start of the kernel boot to the shell command line > > > prompt is shown from the console. [ Different people may have > > > different OS boot time definitions. ] > > > > > > 3) Measurement method (very rough method): > > > A timer in the kernel periodically prints the boot time every 100ms. > > > As soon as the shell command line prompt is shown from the console, > > > we record the boot time printed by the timer, then the printed boot > > > time is the OS boot time. > > > > > > 4) Measured OS boot time (in seconds) > > > a) Measured 10 times w/o this patch: > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > The average OS boot time was: ~8.7s > > > > > > b) Measure 10 times w/ this patch: > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > The average OS boot time was: ~8.3s. > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > I still don't really like that: > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > reporting the wait duration within synchronize_rcu() calls between the end of > > the kernel boot and the end of userspace boot may be helpful. > > > > 2) The kernel boot was already covered before this patch so this is about > > userspace code calling into the kernel. Is that piece of code also called > > after the boot? In that case are we missing a conversion from > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > the problem is more general than just boot. > > > > This needs to be analyzed first and if it happens that the issue really > > needs to be fixed with telling the kernel that userspace has completed > > booting, eg: because the problem is not in a few callsites that need conversion > > to expedited but instead in the accumulation of lots of calls that should stay > > as is: > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > may run right after the boot. Either you choose a value that is too low > > and you miss the optimization or the value is too high and you may break > > things. > > > > 4) This should be fixed the way you did: > > a) a kernel parameter like you did > > b) The init process (systemd?) tells the kernel when it judges that userspace > > has completed booting. > > c) Make these interfaces more generic, maybe that information will be useful > > outside RCU. For example the kernel parameter should be > > "user_booted_reported" and the sysfs (should be sysctl?): > > kernel.user_booted = 1 > > d) But yuck, this means we must know if the init process supports that... > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > Thanks. > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > parameter that can be used during the boot. For example on our devices > to speedup a boot we boot the kernel with rcu_expedited: > > XQ-DQ54:/ # cat /proc/cmdline > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001 > XQ-DQ54:/ # > > then a user space can decides if it is needed or not: > > <snip> > rcu_expedited rcu_normal > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > XQ-DQ54:/ # > <snip> > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > true or false. So we can follow and be aligned with rcu_expedited and > rcu_normal parameters. Speaking of aligning, there is also the automated rcu_normal_after_boot boot option correct? I prefer the automated option of doing this. So the approach here is not really unprecedented and is much more robust than relying on userspace too much (I am ok with adding your suggestion *on top* of the automated toggle, but I probably would not have ChromeOS use it if the automated way exists). Or did I miss something? thanks, - Joel
On Tue, Mar 07, 2023 at 07:49:49AM +0000, Zhuo, Qiuxu wrote: > > From: Paul E. McKenney <paulmck@kernel.org> > > [...] > > > > > > Thank you so much Paul for the detailed comments on the measured data. > > > > > > I'm curious how did you figure out the number 24 that we at *least* need. > > > This can guide me on whether the number of samples is enough for > > > future testing ;-). > > > > It is a rough rule of thumb. For more details and accuracy, study up on the > > Student's t-test and related statistical tests. > > > > Of course, this all assumes that the data fits a normal distribution. > > Thanks for this extra information. Good to know the Student's t-test. > > > > I did another 48 measurements (2x of 24) for each case (w/o and w/ > > > Joel's v2 patch) as below. > > > All the testing configurations for the new testing are the same as > > > before. > > > > > > a) Measured 48 times w/o v2 patch (in seconds): > > > 8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4, > > > 8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8, > > > 8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8, > > > 8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6, > > > 8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0, > > > 9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7 > > > The average OS boot time was: ~9.0s > > > > The range is 8.2 through 9.8. > > > > > b) Measure 48 times w/ v2 patch (in seconds): > > > 7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2, > > > 9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2, > > > 8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3, > > > 8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0, > > > 8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8, > > > 8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3, > > > The average OS boot time was: ~8.5s > > > > The range is 7.7 through 9.8. > > > > There is again significant overlap, so it is again unclear that you have a > > statistically significant difference. So could you please calculate the standard > > deviations? > > a's standard deviation is ~0.4. > b's standard deviation is ~0.5. > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. > So, the measurements should be statistically significant to some degree. That single standard deviation means that you have 68% confidence that the difference is real. This is not far above the 50% leval of random noise. 95% is the lowest level that is normally considered to be statistically significant. > The calculated standard deviations are via: > https://www.gigacalculator.com/calculators/standard-deviation-calculator.php Fair enough. Formulas are readily available as well, and most spreadsheets support standard deviation. > > > @Joel Fernandes (Google), you may replace my old data with the above > > > new data in your commit message. > > > > > > > But we can apply the binomial distribution instead of the usual > > > > normal distribution. First, let's sort and take the medians: > > > > > > > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3 Median: 8.7 > > > > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3 Median: 8.2 > > > > > > > > 8/10 of a's data points are greater than 0.1 more than b's median > > > > and 8/10 of b's data points are less than 0.1 less than a's median. > > > > What are the odds that this happens by random chance? > > > > > > > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055. > > > > > > What's the meaning of 0.5 here? Was it the probability (we assume?) > > > that each time b's data point failed (or didn't satisfy) "less than > > > 0.1 less than a's median"? > > > > The meaning of 0.5 is the probability of a given data point being on one side > > or the other of the corresponding distribution's median. This of course > > assumes that the median of the measured data matches that of the > > corresponding distribution, though the fact that the median is also a mode of > > both of the old data sets gives some hope. > > Thanks for the detailed comments on the meaning of 0.5 here. :-) > > > The meaning of the 0.1 is the smallest difference that the data could measure. > > I could have instead chosen 0.0 and asked if there was likely some (perhaps > > tiny) difference, but instead, I chose to ask if there was likely some small but > > meaningful difference. It is better to choose the desired difference before > > measuring the data. > > Thanks for the detailed comments on the meaning of 0.1 here. :-) > > > Why don't you try applying this approach to the new data? You will need the > > general binomial formula. > > Thank you Paul for the suggestion. > I just tried it, but not sure whether my analysis was correct ... > > Analysis 1: > a's median is 8.9. I get 8.95, which is the average of the 24th and 25th members of a in numerical order. > 35/48 b's data points are less than 0.1 less than a's median. > For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5. > So, we have strong confidence that b is 100ms faster than a. I of course get quite a bit stronger confidence, but your 99.9% is good enough. And I get even stronger confidence going in the other direction. However, the fact that a's median varies from 8.7 in the old experiment to 8.95 in this experiment does give some pause. These are after all supposedly drawn from the same distribution. Or did you use a different machine or different OS version or some such in the two sets of measurements? Different time of day and thus different ambient temperature, thus different CPU clock frequency? Assuming identical test setups, let's try the old value of 8.7 from old a to new b. There are 14 elements in new b greater than 8.6, for a probability of 0.17%, or about 98.3% significance. This is still OK. In contrast, the median of the old b is 8.2, which gives extreme confidence. So let's be conservative and use the large-set median. In real life, additional procedures would be needed to estimate the confidence in the median, which turns oout to be nontrivial. When I apply this sort of technique, I usually have all data from each sample being on one side of the median of the other, which simplifies things. ;-) The easiest way to estimate bounds on the median is to "bootstrap", but that works best if you have 1000 samples and can randomly draw 1000 sub-samples each of size 10 from the larger sample and compute the median of each. You can sort these medians and obtain a cumulative distribution. But you have to have an extremely good reason to collect data from 1000 boots, and I don't believe we have that good of a reason. > Analysis 2: > a's median - 0.4 = 8.9 - 0.4 = 8.5. > 24/48 b's data points are less than 0.4 less than a's median. > The probability that a's data points are less than 8.5 is p = 7/48 = 0.1458 This is only 85.4% significant, so... > For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458. > So, looks like we have confidence that b is 400ms faster than a. ...we really cannot say anything about 400ms faster. Again, you need 95% and preferably 99% to really make any sort of claim. You probably need quite a few more samples to say much about 200ms, let alone 400ms. Plus, you really should select the speedup and only then take the measurements. Otherwise, you end up fitting noise. However, assuming identical tests setups, you really can calculate the median from the full data set. > The calculated cumulative binomial distributions P(X) is via: > https://www.gigacalculator.com/calculators/binomial-probability-calculator.php The maxima program's binomial() function agrees with it, so good. ;-) > I apologize if this analysis/discussion bored some of you. ;-) Let's just say that it is a lot simpler when you are measuring larger differences in data with tighter distributions. Me, I usually just say "no" to drawing any sort of conclusion from data sets that overlap this much. Instead, I might check to see if there is some random events adding noise to the boot duration, eliminate that, and hopefully get data that is easier to analyze. But I am good with the 98.3% confidence in a 100ms improvement. So if Joel wishes to make this point, he should feel free to take both of your datasets and use the computation with the worse mean. Thanx, Paul
On Tue, Mar 07, 2023 at 08:41:17AM -0500, Joel Fernandes wrote: > On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > On many systems, a great deal of boot (in userspace) happens after the > > > kernel thinks the boot has completed. It is difficult to determine if > > > the system has really booted from the kernel side. Some features like > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > added that the boot synchronously depends on. Further expedited callbacks > > > can get unexpedited way earlier than it should be, thus slowing down > > > boot (as shown in the data below). > > > > > > For these reasons, this commit adds a config option > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > Userspace can also make RCU's view of the system as booted, by writing the > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > Or even just writing a value of 0 to this sysfs node. > > > However, under no circumstance will the boot be allowed to end earlier > > > than just before init is launched. > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > no config or parameter changes, and just a simple application of this patch. A > > > system designer can also choose a specific value here to keep RCU from marking > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > launched just after the kernel boots will suffer interruptions due to expedited > > > RCU, which previous ended just before init was launched. However, to mitigate > > > such an issue (however unlikely), the user should either tune > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > boots, and before launching the real-time workload. > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > of patch. An excerpt from the data he shared: > > > > > > 1) Testing environment: > > > OS : CentOS Stream 8 (non-RT OS) > > > Kernel : v6.2 > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > 2) OS boot time definition: > > > The time from the start of the kernel boot to the shell command line > > > prompt is shown from the console. [ Different people may have > > > different OS boot time definitions. ] > > > > > > 3) Measurement method (very rough method): > > > A timer in the kernel periodically prints the boot time every 100ms. > > > As soon as the shell command line prompt is shown from the console, > > > we record the boot time printed by the timer, then the printed boot > > > time is the OS boot time. > > > > > > 4) Measured OS boot time (in seconds) > > > a) Measured 10 times w/o this patch: > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > The average OS boot time was: ~8.7s > > > > > > b) Measure 10 times w/ this patch: > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > The average OS boot time was: ~8.3s. > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > I still don't really like that: > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > reporting the wait duration within synchronize_rcu() calls between the end of > > the kernel boot and the end of userspace boot may be helpful. > > Just to clarify (and I feel we discussed this recently) -- there is no > callback I am aware of right now causing a slow boot. The reason for > doing this is we don't have such issues in the future; so it is a > protection. Note the repeated call outs to the scsi callback and also > the rcu_barrier() issue previously fixed. Further, we already see > slight improvements in boot times with disabling lazy during boot (its > not much but its there). Yes, we should fix issues instead of hiding > them - but we also would like to improve the user experience -- just > like we disable lazy and expedited during suspend. > > So what is the problem that you really have with this patch even with > data showing improvements? I actually wanted a mechanism like this > from the beginning and was trying to get Intel to write the patch, but > I ended up writing it. Let's put it another way: kernel boot is mostly code that won't execute again. User boot (or rather the kernel part of it) OTOH is code that is subject to be repeated again. A lot of the kernel boot code is __init code that will execute only once. And there it makes sense to force hurry and expedited because we may easily miss something and after all this all happens only once, also there is no interference with userspace, etc... User boot OTOH use common kernel code: syscalls, signal, files, etc... And that code will be called also after the boot. So if there is something slowing down user boot, there are some good chances that this thing slows down userspace in general. Therefore we need to know exactly what's going on because the problem may be bigger than what you observe on boot. > > > 2) The kernel boot was already covered before this patch so this is about > > userspace code calling into the kernel. Is that piece of code also called > > after the boot? In that case are we missing a conversion from > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > the problem is more general than just boot. > > > > This needs to be analyzed first and if it happens that the issue really > > needs to be fixed with telling the kernel that userspace has completed > > booting, eg: because the problem is not in a few callsites that need conversion > > to expedited but instead in the accumulation of lots of calls that should stay > > as is: > > There is no such callback I am aware off that needs such a conversion > and I don't think that will help give any guarantees because there is > no preventing someone from adding a callback that synchronously slows > boot. The approach here is to put a protection. However, I will do > some more investigations into what else may be slowing things as I do > hold a lot of weight for your words! :) Kernel boot is already handled and userspace boot can not add a new RCU callback. > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > may run right after the boot. Either you choose a value that is too low > > and you miss the optimization or the value is too high and you may break > > things. > > So someone is presenting a timing sensitive workload within 15 seconds > of boot? Please provide some evidence of that. I have no idea, there are billions of computers running out there, it's a disaster... > The only evidence right now is on the plus side even for the RT system. Right it's improving the boot of an RT system, doesn't mean it's not breaking post boot of others. > > > 4) This should be fixed the way you did: > > a) a kernel parameter like you did > > b) The init process (systemd?) tells the kernel when it judges that userspace > > has completed booting. > > c) Make these interfaces more generic, maybe that information will be useful > > outside RCU. For example the kernel parameter should be > > "user_booted_reported" and the sysfs (should be sysctl?): > > kernel.user_booted = 1 > > d) But yuck, this means we must know if the init process supports that... > > > > For these reasons, let's make sure we know exactly what is going on first. > > I can investigate this more and get back to you. > > One of the challenges is getting boot tracing working properly. > Systems do weird things like turning off tracing during boot and/or > clearing trace buffers. Just compare the average and total duration of all synchronize_rcu() calls (before and after forcing expedited) between launching initand userspace boot completion. Sure there will be noise but if a difference can be measured before and after your patch, then a difference might be measureable on tracing as well... Well of course tracing can induce subtle things... But let's try at least, we want to know what we are fixing here. Thanks.
On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > the system has really booted from the kernel side. Some features like > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > boot (as shown in the data below). > > > > > > > > For these reasons, this commit adds a config option > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > Or even just writing a value of 0 to this sysfs node. > > > > However, under no circumstance will the boot be allowed to end earlier > > > > than just before init is launched. > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > no config or parameter changes, and just a simple application of this patch. A > > > > system designer can also choose a specific value here to keep RCU from marking > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > such an issue (however unlikely), the user should either tune > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > boots, and before launching the real-time workload. > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > of patch. An excerpt from the data he shared: > > > > > > > > 1) Testing environment: > > > > OS : CentOS Stream 8 (non-RT OS) > > > > Kernel : v6.2 > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > 2) OS boot time definition: > > > > The time from the start of the kernel boot to the shell command line > > > > prompt is shown from the console. [ Different people may have > > > > different OS boot time definitions. ] > > > > > > > > 3) Measurement method (very rough method): > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > As soon as the shell command line prompt is shown from the console, > > > > we record the boot time printed by the timer, then the printed boot > > > > time is the OS boot time. > > > > > > > > 4) Measured OS boot time (in seconds) > > > > a) Measured 10 times w/o this patch: > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > The average OS boot time was: ~8.7s > > > > > > > > b) Measure 10 times w/ this patch: > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > The average OS boot time was: ~8.3s. > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > I still don't really like that: > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > userspace code calling into the kernel. Is that piece of code also called > > > after the boot? In that case are we missing a conversion from > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > the problem is more general than just boot. > > > > > > This needs to be analyzed first and if it happens that the issue really > > > needs to be fixed with telling the kernel that userspace has completed > > > booting, eg: because the problem is not in a few callsites that need conversion > > > to expedited but instead in the accumulation of lots of calls that should stay > > > as is: > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > may run right after the boot. Either you choose a value that is too low > > > and you miss the optimization or the value is too high and you may break > > > things. > > > > > > 4) This should be fixed the way you did: > > > a) a kernel parameter like you did > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > has completed booting. > > > c) Make these interfaces more generic, maybe that information will be useful > > > outside RCU. For example the kernel parameter should be > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > kernel.user_booted = 1 > > > d) But yuck, this means we must know if the init process supports that... > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > Thanks. > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > parameter that can be used during the boot. For example on our devices > > to speedup a boot we boot the kernel with rcu_expedited: > > > > XQ-DQ54:/ # cat /proc/cmdline > > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001 > > XQ-DQ54:/ # > > > > then a user space can decides if it is needed or not: > > > > <snip> > > rcu_expedited rcu_normal > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > XQ-DQ54:/ # > > <snip> > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > true or false. So we can follow and be aligned with rcu_expedited and > > rcu_normal parameters. > > Speaking of aligning, there is also the automated > rcu_normal_after_boot boot option correct? I prefer the automated > option of doing this. So the approach here is not really unprecedented > and is much more robust than relying on userspace too much (I am ok > with adding your suggestion *on top* of the automated toggle, but I > probably would not have ChromeOS use it if the automated way exists). > Or did I miss something? See this commit: 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives") Antti provided this commit precisely in order to allow Android devices to expedite the boot process and to shut off the expediting at a time of Android userspace's choosing. So Android has been making this work for about ten years, which strikes me as an adequate proof of concept. ;-) Of course, Android has a rather tightly controlled userspace, as do real-time embedded systems (I sure hope, anyway!). Which is why your timeout-based fallback/backup makes a lot of sense. And why someone might want an aggressive indication when that timeout-based backup is needed. Thanx, Paul
On Tue, Mar 7, 2023 at 12:19 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Mar 07, 2023 at 08:41:17AM -0500, Joel Fernandes wrote: > > On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > the system has really booted from the kernel side. Some features like > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > boot (as shown in the data below). > > > > > > > > For these reasons, this commit adds a config option > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > Or even just writing a value of 0 to this sysfs node. > > > > However, under no circumstance will the boot be allowed to end earlier > > > > than just before init is launched. > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > no config or parameter changes, and just a simple application of this patch. A > > > > system designer can also choose a specific value here to keep RCU from marking > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > such an issue (however unlikely), the user should either tune > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > boots, and before launching the real-time workload. > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > of patch. An excerpt from the data he shared: > > > > > > > > 1) Testing environment: > > > > OS : CentOS Stream 8 (non-RT OS) > > > > Kernel : v6.2 > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > 2) OS boot time definition: > > > > The time from the start of the kernel boot to the shell command line > > > > prompt is shown from the console. [ Different people may have > > > > different OS boot time definitions. ] > > > > > > > > 3) Measurement method (very rough method): > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > As soon as the shell command line prompt is shown from the console, > > > > we record the boot time printed by the timer, then the printed boot > > > > time is the OS boot time. > > > > > > > > 4) Measured OS boot time (in seconds) > > > > a) Measured 10 times w/o this patch: > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > The average OS boot time was: ~8.7s > > > > > > > > b) Measure 10 times w/ this patch: > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > The average OS boot time was: ~8.3s. > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > I still don't really like that: > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > the kernel boot and the end of userspace boot may be helpful. > > > > Just to clarify (and I feel we discussed this recently) -- there is no > > callback I am aware of right now causing a slow boot. The reason for > > doing this is we don't have such issues in the future; so it is a > > protection. Note the repeated call outs to the scsi callback and also > > the rcu_barrier() issue previously fixed. Further, we already see > > slight improvements in boot times with disabling lazy during boot (its > > not much but its there). Yes, we should fix issues instead of hiding > > them - but we also would like to improve the user experience -- just > > like we disable lazy and expedited during suspend. > > > > So what is the problem that you really have with this patch even with > > data showing improvements? I actually wanted a mechanism like this > > from the beginning and was trying to get Intel to write the patch, but > > I ended up writing it. > > Let's put it another way: kernel boot is mostly code that won't execute > again. User boot (or rather the kernel part of it) OTOH is code that is > subject to be repeated again. > > A lot of the kernel boot code is __init code that will execute only once. > And there it makes sense to force hurry and expedited because we may easily > miss something and after all this all happens only once, also there is no > interference with userspace, etc... > > User boot OTOH use common kernel code: syscalls, signal, files, etc... And that > code will be called also after the boot. > > So if there is something slowing down user boot, there are some good chances > that this thing slows down userspace in general. > > Therefore we need to know exactly what's going on because the problem may be > bigger than what you observe on boot. These are good points. It motivates me to dig further, as we may be setting ourselves up for longer term problems for shorter term gains otherwise. I am thinking I finish my debugobjects patch soon which adds metadata to callbacks and expose the details via debugfs, and provide it to Qiuxu and ChromeOS folks to run and study the boot time. > > > 2) The kernel boot was already covered before this patch so this is about > > > userspace code calling into the kernel. Is that piece of code also called > > > after the boot? In that case are we missing a conversion from > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > the problem is more general than just boot. > > > > > > This needs to be analyzed first and if it happens that the issue really > > > needs to be fixed with telling the kernel that userspace has completed > > > booting, eg: because the problem is not in a few callsites that need conversion > > > to expedited but instead in the accumulation of lots of calls that should stay > > > as is: > > > > There is no such callback I am aware off that needs such a conversion > > and I don't think that will help give any guarantees because there is > > no preventing someone from adding a callback that synchronously slows > > boot. The approach here is to put a protection. However, I will do > > some more investigations into what else may be slowing things as I do > > hold a lot of weight for your words! :) > > Kernel boot is already handled and userspace boot can not add a new RCU callback. Right, so that is in line with your point about userspace slowing down even after boot, if I am not mistaken. > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > may run right after the boot. Either you choose a value that is too low > > > and you miss the optimization or the value is too high and you may break > > > things. > > > > So someone is presenting a timing sensitive workload within 15 seconds > > of boot? Please provide some evidence of that. > > I have no idea, there are billions of computers running out there, it's a disaster... Haha... Linux success sounds like a nice problem to have. ;-) > > The only evidence right now is on the plus side even for the RT system. > > Right it's improving the boot of an RT system, doesn't mean it's not breaking > post boot of others. True. However, I still feel a protection in the future would make sense in general after we finish these investigations. > > > 4) This should be fixed the way you did: > > > a) a kernel parameter like you did > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > has completed booting. > > > c) Make these interfaces more generic, maybe that information will be useful > > > outside RCU. For example the kernel parameter should be > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > kernel.user_booted = 1 > > > d) But yuck, this means we must know if the init process supports that... > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > I can investigate this more and get back to you. > > > > One of the challenges is getting boot tracing working properly. > > Systems do weird things like turning off tracing during boot and/or > > clearing trace buffers. > > Just compare the average and total duration of all synchronize_rcu() calls > (before and after forcing expedited) between launching initand userspace boot > completion. Sure there will be noise but if a difference can be measured before > and after your patch, then a difference might be measureable on tracing as > well... Well of course tracing can induce subtle things... But let's try at > least, we want to know what we are fixing here. You mean using function graph tracer? For the synchronize_rcu() stuff, I'll have to defer to the Qiuxu to try tracing synchronize_rcu() on his PREEMPT_RT system since I don't have access to that system. I can try to provide a patch that will make tracing that easier, but that will be a few days probably as I'm traveling... On ChromeOS we are seeing slight improvements with this patch (though it is not clear whether it is statistically significant). So I have to dig deeper what is going on there. - Joel
On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote: > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > > the system has really booted from the kernel side. Some features like > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > > boot (as shown in the data below). > > > > > > > > > > For these reasons, this commit adds a config option > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > > Or even just writing a value of 0 to this sysfs node. > > > > > However, under no circumstance will the boot be allowed to end earlier > > > > > than just before init is launched. > > > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > > no config or parameter changes, and just a simple application of this patch. A > > > > > system designer can also choose a specific value here to keep RCU from marking > > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > > such an issue (however unlikely), the user should either tune > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > > boots, and before launching the real-time workload. > > > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > of patch. An excerpt from the data he shared: > > > > > > > > > > 1) Testing environment: > > > > > OS : CentOS Stream 8 (non-RT OS) > > > > > Kernel : v6.2 > > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > > > 2) OS boot time definition: > > > > > The time from the start of the kernel boot to the shell command line > > > > > prompt is shown from the console. [ Different people may have > > > > > different OS boot time definitions. ] > > > > > > > > > > 3) Measurement method (very rough method): > > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > > As soon as the shell command line prompt is shown from the console, > > > > > we record the boot time printed by the timer, then the printed boot > > > > > time is the OS boot time. > > > > > > > > > > 4) Measured OS boot time (in seconds) > > > > > a) Measured 10 times w/o this patch: > > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > > The average OS boot time was: ~8.7s > > > > > > > > > > b) Measure 10 times w/ this patch: > > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > > The average OS boot time was: ~8.3s. > > > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > I still don't really like that: > > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > > userspace code calling into the kernel. Is that piece of code also called > > > > after the boot? In that case are we missing a conversion from > > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > > the problem is more general than just boot. > > > > > > > > This needs to be analyzed first and if it happens that the issue really > > > > needs to be fixed with telling the kernel that userspace has completed > > > > booting, eg: because the problem is not in a few callsites that need conversion > > > > to expedited but instead in the accumulation of lots of calls that should stay > > > > as is: > > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > > may run right after the boot. Either you choose a value that is too low > > > > and you miss the optimization or the value is too high and you may break > > > > things. > > > > > > > > 4) This should be fixed the way you did: > > > > a) a kernel parameter like you did > > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > > has completed booting. > > > > c) Make these interfaces more generic, maybe that information will be useful > > > > outside RCU. For example the kernel parameter should be > > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > > kernel.user_booted = 1 > > > > d) But yuck, this means we must know if the init process supports that... > > > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > > > Thanks. > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > > parameter that can be used during the boot. For example on our devices > > > to speedup a boot we boot the kernel with rcu_expedited: > > > > > > XQ-DQ54:/ # cat /proc/cmdline > > > XQ-DQ54:/ # > > > > > > then a user space can decides if it is needed or not: > > > > > > <snip> > > > rcu_expedited rcu_normal > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > > XQ-DQ54:/ # > > > <snip> > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > > true or false. So we can follow and be aligned with rcu_expedited and > > > rcu_normal parameters. > > > > Speaking of aligning, there is also the automated > > rcu_normal_after_boot boot option correct? I prefer the automated > > option of doing this. So the approach here is not really unprecedented > > and is much more robust than relying on userspace too much (I am ok > > with adding your suggestion *on top* of the automated toggle, but I > > probably would not have ChromeOS use it if the automated way exists). > > Or did I miss something? > > See this commit: > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives") > > Antti provided this commit precisely in order to allow Android devices > to expedite the boot process and to shut off the expediting at a time of > Android userspace's choosing. So Android has been making this work for > about ten years, which strikes me as an adequate proof of concept. ;-) Thanks for the pointer. That's true. Looking at Android sources, I find that Android Mediatek devices at least are setting rcu_expedited to 1 at late stage of their userspace boot (which is weird, it should be set to 1 as early as possible), and interestingly I cannot find them resetting it back to 0!. Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P > Of course, Android has a rather tightly controlled userspace, as do > real-time embedded systems (I sure hope, anyway!). Which is why your > timeout-based fallback/backup makes a lot of sense. And why someone might > want an aggressive indication when that timeout-based backup is needed. Or someone designs a system but is unaware of RCU behavior during boot. ;-) thanks, - Joel
On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote: > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote: > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > > > the system has really booted from the kernel side. Some features like > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > > > boot (as shown in the data below). > > > > > > > > > > > > For these reasons, this commit adds a config option > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > > > Or even just writing a value of 0 to this sysfs node. > > > > > > However, under no circumstance will the boot be allowed to end earlier > > > > > > than just before init is launched. > > > > > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > > > no config or parameter changes, and just a simple application of this patch. A > > > > > > system designer can also choose a specific value here to keep RCU from marking > > > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > > > such an issue (however unlikely), the user should either tune > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > > > boots, and before launching the real-time workload. > > > > > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > > of patch. An excerpt from the data he shared: > > > > > > > > > > > > 1) Testing environment: > > > > > > OS : CentOS Stream 8 (non-RT OS) > > > > > > Kernel : v6.2 > > > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > > > > > 2) OS boot time definition: > > > > > > The time from the start of the kernel boot to the shell command line > > > > > > prompt is shown from the console. [ Different people may have > > > > > > different OS boot time definitions. ] > > > > > > > > > > > > 3) Measurement method (very rough method): > > > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > > > As soon as the shell command line prompt is shown from the console, > > > > > > we record the boot time printed by the timer, then the printed boot > > > > > > time is the OS boot time. > > > > > > > > > > > > 4) Measured OS boot time (in seconds) > > > > > > a) Measured 10 times w/o this patch: > > > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > > > The average OS boot time was: ~8.7s > > > > > > > > > > > > b) Measure 10 times w/ this patch: > > > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > > > The average OS boot time was: ~8.3s. > > > > > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > > > I still don't really like that: > > > > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > > > userspace code calling into the kernel. Is that piece of code also called > > > > > after the boot? In that case are we missing a conversion from > > > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > > > the problem is more general than just boot. > > > > > > > > > > This needs to be analyzed first and if it happens that the issue really > > > > > needs to be fixed with telling the kernel that userspace has completed > > > > > booting, eg: because the problem is not in a few callsites that need conversion > > > > > to expedited but instead in the accumulation of lots of calls that should stay > > > > > as is: > > > > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > > > may run right after the boot. Either you choose a value that is too low > > > > > and you miss the optimization or the value is too high and you may break > > > > > things. > > > > > > > > > > 4) This should be fixed the way you did: > > > > > a) a kernel parameter like you did > > > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > > > has completed booting. > > > > > c) Make these interfaces more generic, maybe that information will be useful > > > > > outside RCU. For example the kernel parameter should be > > > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > > > kernel.user_booted = 1 > > > > > d) But yuck, this means we must know if the init process supports that... > > > > > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > > > > > Thanks. > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > > > parameter that can be used during the boot. For example on our devices > > > > to speedup a boot we boot the kernel with rcu_expedited: > > > > > > > > XQ-DQ54:/ # cat /proc/cmdline > > > > XQ-DQ54:/ # > > > > > > > > then a user space can decides if it is needed or not: > > > > > > > > <snip> > > > > rcu_expedited rcu_normal > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > > > XQ-DQ54:/ # > > > > <snip> > > > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > > > true or false. So we can follow and be aligned with rcu_expedited and > > > > rcu_normal parameters. > > > > > > Speaking of aligning, there is also the automated > > > rcu_normal_after_boot boot option correct? I prefer the automated > > > option of doing this. So the approach here is not really unprecedented > > > and is much more robust than relying on userspace too much (I am ok > > > with adding your suggestion *on top* of the automated toggle, but I > > > probably would not have ChromeOS use it if the automated way exists). > > > Or did I miss something? > > > > See this commit: > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives") > > > > Antti provided this commit precisely in order to allow Android devices > > to expedite the boot process and to shut off the expediting at a time of > > Android userspace's choosing. So Android has been making this work for > > about ten years, which strikes me as an adequate proof of concept. ;-) > > Thanks for the pointer. That's true. Looking at Android sources, I find that > Android Mediatek devices at least are setting rcu_expedited to 1 at late > stage of their userspace boot (which is weird, it should be set to 1 as early > as possible), and interestingly I cannot find them resetting it back to 0!. > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P Interesting. Though this is consistent with Antti's commit log, where he talks about expediting grace periods but not unexpediting them. > > Of course, Android has a rather tightly controlled userspace, as do > > real-time embedded systems (I sure hope, anyway!). Which is why your > > timeout-based fallback/backup makes a lot of sense. And why someone might > > want an aggressive indication when that timeout-based backup is needed. > > Or someone designs a system but is unaware of RCU behavior during boot. ;-) RCU is just doing what they told it to! ;-) Thanx, Paul
On Tue, Mar 07, 2023 at 11:27:26AM -0800, Paul E. McKenney wrote: > On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote: > > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote: > > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > > > > the system has really booted from the kernel side. Some features like > > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > > > > boot (as shown in the data below). > > > > > > > > > > > > > > For these reasons, this commit adds a config option > > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > > > > Or even just writing a value of 0 to this sysfs node. > > > > > > > However, under no circumstance will the boot be allowed to end earlier > > > > > > > than just before init is launched. > > > > > > > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > > > > no config or parameter changes, and just a simple application of this patch. A > > > > > > > system designer can also choose a specific value here to keep RCU from marking > > > > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > > > > such an issue (however unlikely), the user should either tune > > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > > > > boots, and before launching the real-time workload. > > > > > > > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > > > of patch. An excerpt from the data he shared: > > > > > > > > > > > > > > 1) Testing environment: > > > > > > > OS : CentOS Stream 8 (non-RT OS) > > > > > > > Kernel : v6.2 > > > > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > > > > > > > 2) OS boot time definition: > > > > > > > The time from the start of the kernel boot to the shell command line > > > > > > > prompt is shown from the console. [ Different people may have > > > > > > > different OS boot time definitions. ] > > > > > > > > > > > > > > 3) Measurement method (very rough method): > > > > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > > > > As soon as the shell command line prompt is shown from the console, > > > > > > > we record the boot time printed by the timer, then the printed boot > > > > > > > time is the OS boot time. > > > > > > > > > > > > > > 4) Measured OS boot time (in seconds) > > > > > > > a) Measured 10 times w/o this patch: > > > > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > > > > The average OS boot time was: ~8.7s > > > > > > > > > > > > > > b) Measure 10 times w/ this patch: > > > > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > > > > The average OS boot time was: ~8.3s. > > > > > > > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > > > > > I still don't really like that: > > > > > > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > > > > userspace code calling into the kernel. Is that piece of code also called > > > > > > after the boot? In that case are we missing a conversion from > > > > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > > > > the problem is more general than just boot. > > > > > > > > > > > > This needs to be analyzed first and if it happens that the issue really > > > > > > needs to be fixed with telling the kernel that userspace has completed > > > > > > booting, eg: because the problem is not in a few callsites that need conversion > > > > > > to expedited but instead in the accumulation of lots of calls that should stay > > > > > > as is: > > > > > > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > > > > may run right after the boot. Either you choose a value that is too low > > > > > > and you miss the optimization or the value is too high and you may break > > > > > > things. > > > > > > > > > > > > 4) This should be fixed the way you did: > > > > > > a) a kernel parameter like you did > > > > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > > > > has completed booting. > > > > > > c) Make these interfaces more generic, maybe that information will be useful > > > > > > outside RCU. For example the kernel parameter should be > > > > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > > > > kernel.user_booted = 1 > > > > > > d) But yuck, this means we must know if the init process supports that... > > > > > > > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > > > > > > > Thanks. > > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > > > > parameter that can be used during the boot. For example on our devices > > > > > to speedup a boot we boot the kernel with rcu_expedited: > > > > > > > > > > XQ-DQ54:/ # cat /proc/cmdline > > > > > XQ-DQ54:/ # > > > > > > > > > > then a user space can decides if it is needed or not: > > > > > > > > > > <snip> > > > > > rcu_expedited rcu_normal > > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > > > > XQ-DQ54:/ # > > > > > <snip> > > > > > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > > > > true or false. So we can follow and be aligned with rcu_expedited and > > > > > rcu_normal parameters. > > > > > > > > Speaking of aligning, there is also the automated > > > > rcu_normal_after_boot boot option correct? I prefer the automated > > > > option of doing this. So the approach here is not really unprecedented > > > > and is much more robust than relying on userspace too much (I am ok > > > > with adding your suggestion *on top* of the automated toggle, but I > > > > probably would not have ChromeOS use it if the automated way exists). > > > > Or did I miss something? > > > > > > See this commit: > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives") > > > > > > Antti provided this commit precisely in order to allow Android devices > > > to expedite the boot process and to shut off the expediting at a time of > > > Android userspace's choosing. So Android has been making this work for > > > about ten years, which strikes me as an adequate proof of concept. ;-) > > > > Thanks for the pointer. That's true. Looking at Android sources, I find that > > Android Mediatek devices at least are setting rcu_expedited to 1 at late > > stage of their userspace boot (which is weird, it should be set to 1 as early > > as possible), and interestingly I cannot find them resetting it back to 0!. > > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > Interesting. Though this is consistent with Antti's commit log, where > he talks about expediting grace periods but not unexpediting them. > Do you think we need to unexpedite it? :)))) -- Uladzislau Rezki
On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > the system has really booted from the kernel side. Some features like > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > boot (as shown in the data below). > > > > > > > > For these reasons, this commit adds a config option > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > Or even just writing a value of 0 to this sysfs node. > > > > However, under no circumstance will the boot be allowed to end earlier > > > > than just before init is launched. > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > no config or parameter changes, and just a simple application of this patch. A > > > > system designer can also choose a specific value here to keep RCU from marking > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > such an issue (however unlikely), the user should either tune > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > boots, and before launching the real-time workload. > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > of patch. An excerpt from the data he shared: > > > > > > > > 1) Testing environment: > > > > OS : CentOS Stream 8 (non-RT OS) > > > > Kernel : v6.2 > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > 2) OS boot time definition: > > > > The time from the start of the kernel boot to the shell command line > > > > prompt is shown from the console. [ Different people may have > > > > different OS boot time definitions. ] > > > > > > > > 3) Measurement method (very rough method): > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > As soon as the shell command line prompt is shown from the console, > > > > we record the boot time printed by the timer, then the printed boot > > > > time is the OS boot time. > > > > > > > > 4) Measured OS boot time (in seconds) > > > > a) Measured 10 times w/o this patch: > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > The average OS boot time was: ~8.7s > > > > > > > > b) Measure 10 times w/ this patch: > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > The average OS boot time was: ~8.3s. > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > I still don't really like that: > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > userspace code calling into the kernel. Is that piece of code also called > > > after the boot? In that case are we missing a conversion from > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > the problem is more general than just boot. > > > > > > This needs to be analyzed first and if it happens that the issue really > > > needs to be fixed with telling the kernel that userspace has completed > > > booting, eg: because the problem is not in a few callsites that need conversion > > > to expedited but instead in the accumulation of lots of calls that should stay > > > as is: > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > may run right after the boot. Either you choose a value that is too low > > > and you miss the optimization or the value is too high and you may break > > > things. > > > > > > 4) This should be fixed the way you did: > > > a) a kernel parameter like you did > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > has completed booting. > > > c) Make these interfaces more generic, maybe that information will be useful > > > outside RCU. For example the kernel parameter should be > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > kernel.user_booted = 1 > > > d) But yuck, this means we must know if the init process supports that... > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > Thanks. > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > parameter that can be used during the boot. For example on our devices > > to speedup a boot we boot the kernel with rcu_expedited: > > > > XQ-DQ54:/ # cat /proc/cmdline > > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001 > > XQ-DQ54:/ # > > > > then a user space can decides if it is needed or not: > > > > <snip> > > rcu_expedited rcu_normal > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > XQ-DQ54:/ # > > <snip> > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > true or false. So we can follow and be aligned with rcu_expedited and > > rcu_normal parameters. > > Speaking of aligning, there is also the automated > rcu_normal_after_boot boot option correct? I prefer the automated > option of doing this. So the approach here is not really unprecedented > and is much more robust than relying on userspace too much (I am ok > with adding your suggestion *on top* of the automated toggle, but I > probably would not have ChromeOS use it if the automated way exists). > Or did I miss something? > According to name of the rcu_end_inkernel_boot() function and a place when it is invoked we can conclude that it marks the end of kernel boot and it happens before running an "init" process. With your patch we change a behavior. The initialization occurs not right after a kernel is up and running but rather after 15 seconds timeout what at least does not correspond to a function name. Apart from that an expected behavior might be different. For example some test-suites or smoke tests, etc. Another thought about "automated boot complete" is we do not know from kernel space when it really completes for user space, because from kernel space we are done and we can detect it. In this cases a user space is a right candidate to say when it is ready. For example for Android a boot complete happens when a home-screen appears. For Chrome OS i think there is something similar. There must be a boot complete event in its init scripts or something similar. This is just my thoughts. I do not really mind but i also do not see a high need in having it. -- Uladzislau Rezki
> On Mar 7, 2023, at 9:19 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Mar 07, 2023 at 08:41:17AM -0500, Joel Fernandes wrote: >>> On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote: >>> >>> On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: >>>> On many systems, a great deal of boot (in userspace) happens after the >>>> kernel thinks the boot has completed. It is difficult to determine if >>>> the system has really booted from the kernel side. Some features like >>>> lazy-RCU can risk slowing down boot time if, say, a callback has been >>>> added that the boot synchronously depends on. Further expedited callbacks >>>> can get unexpedited way earlier than it should be, thus slowing down >>>> boot (as shown in the data below). >>>> >>>> For these reasons, this commit adds a config option >>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. >>>> Userspace can also make RCU's view of the system as booted, by writing the >>>> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay >>>> Or even just writing a value of 0 to this sysfs node. >>>> However, under no circumstance will the boot be allowed to end earlier >>>> than just before init is launched. >>>> >>>> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This >>>> suites ChromeOS and also a PREEMPT_RT system below very well, which need >>>> no config or parameter changes, and just a simple application of this patch. A >>>> system designer can also choose a specific value here to keep RCU from marking >>>> boot completion. As noted earlier, RCU's perspective of the system as booted >>>> will not be marker until at least rcu_boot_end_delay milliseconds have passed >>>> or an update is made via writing a small value (or 0) in milliseconds to: >>>> /sys/module/rcupdate/parameters/rcu_boot_end_delay. >>>> >>>> One side-effect of this patch is, there is a risk that a real-time workload >>>> launched just after the kernel boots will suffer interruptions due to expedited >>>> RCU, which previous ended just before init was launched. However, to mitigate >>>> such an issue (however unlikely), the user should either tune >>>> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value >>>> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace >>>> boots, and before launching the real-time workload. >>>> >>>> Qiuxu also noted impressive boot-time improvements with earlier version >>>> of patch. An excerpt from the data he shared: >>>> >>>> 1) Testing environment: >>>> OS : CentOS Stream 8 (non-RT OS) >>>> Kernel : v6.2 >>>> Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) >>>> Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … >>>> >>>> 2) OS boot time definition: >>>> The time from the start of the kernel boot to the shell command line >>>> prompt is shown from the console. [ Different people may have >>>> different OS boot time definitions. ] >>>> >>>> 3) Measurement method (very rough method): >>>> A timer in the kernel periodically prints the boot time every 100ms. >>>> As soon as the shell command line prompt is shown from the console, >>>> we record the boot time printed by the timer, then the printed boot >>>> time is the OS boot time. >>>> >>>> 4) Measured OS boot time (in seconds) >>>> a) Measured 10 times w/o this patch: >>>> 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s >>>> The average OS boot time was: ~8.7s >>>> >>>> b) Measure 10 times w/ this patch: >>>> 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s >>>> The average OS boot time was: ~8.3s. >>>> >>>> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> >>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> >>> >>> I still don't really like that: >>> >>> 1) It feels like we are curing a symptom for which we don't know the cause. >>> Which RCU write side caller is the source of this slow boot? Some tracepoints >>> reporting the wait duration within synchronize_rcu() calls between the end of >>> the kernel boot and the end of userspace boot may be helpful. >> >> Just to clarify (and I feel we discussed this recently) -- there is no >> callback I am aware of right now causing a slow boot. The reason for >> doing this is we don't have such issues in the future; so it is a >> protection. Note the repeated call outs to the scsi callback and also >> the rcu_barrier() issue previously fixed. Further, we already see >> slight improvements in boot times with disabling lazy during boot (its >> not much but its there). Yes, we should fix issues instead of hiding >> them - but we also would like to improve the user experience -- just >> like we disable lazy and expedited during suspend. >> >> So what is the problem that you really have with this patch even with >> data showing improvements? I actually wanted a mechanism like this >> from the beginning and was trying to get Intel to write the patch, but >> I ended up writing it. > > Let's put it another way: kernel boot is mostly code that won't execute > again. User boot (or rather the kernel part of it) OTOH is code that is > subject to be repeated again. > > A lot of the kernel boot code is __init code that will execute only once. > And there it makes sense to force hurry and expedited because we may easily > miss something and after all this all happens only once, also there is no > interference with userspace, etc... > > User boot OTOH use common kernel code: syscalls, signal, files, etc... And that > code will be called also after the boot. > > So if there is something slowing down user boot, there are some good chances > that this thing slows down userspace in general. > > Therefore we need to know exactly what's going on because the problem may be > bigger than what you observe on boot. Just to add to previous reply: One thing to consider is that it is more of a performance improvement for booting in expedited mode to fallback to normal later, than a bug fix. Repeated synchronize_rcu() can easily add 100s of milliseconds and to remedy that — a conversion of the call from normal API to the expedited API will not help. What is needed is user mode to notify kernel or do some kind of timed fallback line I did here. Both of these approaches have their pros and cons (and so IMO we should probably give an option to do both). Thanks, - Joel > >> >>> 2) The kernel boot was already covered before this patch so this is about >>> userspace code calling into the kernel. Is that piece of code also called >>> after the boot? In that case are we missing a conversion from >>> synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then >>> the problem is more general than just boot. >>> >>> This needs to be analyzed first and if it happens that the issue really >>> needs to be fixed with telling the kernel that userspace has completed >>> booting, eg: because the problem is not in a few callsites that need conversion >>> to expedited but instead in the accumulation of lots of calls that should stay >>> as is: >> >> There is no such callback I am aware off that needs such a conversion >> and I don't think that will help give any guarantees because there is >> no preventing someone from adding a callback that synchronously slows >> boot. The approach here is to put a protection. However, I will do >> some more investigations into what else may be slowing things as I do >> hold a lot of weight for your words! :) > > Kernel boot is already handled and userspace boot can not add a new RCU callback. > >> >>> >>> 3) This arbitrary timeout looks dangerous to me as latency sensitive code >>> may run right after the boot. Either you choose a value that is too low >>> and you miss the optimization or the value is too high and you may break >>> things. >> >> So someone is presenting a timing sensitive workload within 15 seconds >> of boot? Please provide some evidence of that. > > I have no idea, there are billions of computers running out there, it's a disaster... > >> The only evidence right now is on the plus side even for the RT system. > > Right it's improving the boot of an RT system, doesn't mean it's not breaking > post boot of others. > >> >>> 4) This should be fixed the way you did: >>> a) a kernel parameter like you did >>> b) The init process (systemd?) tells the kernel when it judges that userspace >>> has completed booting. >>> c) Make these interfaces more generic, maybe that information will be useful >>> outside RCU. For example the kernel parameter should be >>> "user_booted_reported" and the sysfs (should be sysctl?): >>> kernel.user_booted = 1 >>> d) But yuck, this means we must know if the init process supports that... >>> >>> For these reasons, let's make sure we know exactly what is going on first. >> >> I can investigate this more and get back to you. >> >> One of the challenges is getting boot tracing working properly. >> Systems do weird things like turning off tracing during boot and/or >> clearing trace buffers. > > Just compare the average and total duration of all synchronize_rcu() calls > (before and after forcing expedited) between launching initand userspace boot > completion. Sure there will be noise but if a difference can be measured before > and after your patch, then a difference might be measureable on tracing as > well... Well of course tracing can induce subtle things... But let's try at > least, we want to know what we are fixing here. > > Thanks. >
On Wed, Mar 08, 2023 at 10:41:19AM +0100, Uladzislau Rezki wrote: > On Tue, Mar 07, 2023 at 11:27:26AM -0800, Paul E. McKenney wrote: > > On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote: > > > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote: > > > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > > > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > > > > > the system has really booted from the kernel side. Some features like > > > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > > > > > boot (as shown in the data below). > > > > > > > > > > > > > > > > For these reasons, this commit adds a config option > > > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > > > > > Or even just writing a value of 0 to this sysfs node. > > > > > > > > However, under no circumstance will the boot be allowed to end earlier > > > > > > > > than just before init is launched. > > > > > > > > > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > > > > > no config or parameter changes, and just a simple application of this patch. A > > > > > > > > system designer can also choose a specific value here to keep RCU from marking > > > > > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > > > > > such an issue (however unlikely), the user should either tune > > > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > > > > > boots, and before launching the real-time workload. > > > > > > > > > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > > > > of patch. An excerpt from the data he shared: > > > > > > > > > > > > > > > > 1) Testing environment: > > > > > > > > OS : CentOS Stream 8 (non-RT OS) > > > > > > > > Kernel : v6.2 > > > > > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > > > > > > > > > 2) OS boot time definition: > > > > > > > > The time from the start of the kernel boot to the shell command line > > > > > > > > prompt is shown from the console. [ Different people may have > > > > > > > > different OS boot time definitions. ] > > > > > > > > > > > > > > > > 3) Measurement method (very rough method): > > > > > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > > > > > As soon as the shell command line prompt is shown from the console, > > > > > > > > we record the boot time printed by the timer, then the printed boot > > > > > > > > time is the OS boot time. > > > > > > > > > > > > > > > > 4) Measured OS boot time (in seconds) > > > > > > > > a) Measured 10 times w/o this patch: > > > > > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > > > > > The average OS boot time was: ~8.7s > > > > > > > > > > > > > > > > b) Measure 10 times w/ this patch: > > > > > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > > > > > The average OS boot time was: ~8.3s. > > > > > > > > > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > > > > > > > I still don't really like that: > > > > > > > > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > > > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > > > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > > > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > > > > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > > > > > userspace code calling into the kernel. Is that piece of code also called > > > > > > > after the boot? In that case are we missing a conversion from > > > > > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > > > > > the problem is more general than just boot. > > > > > > > > > > > > > > This needs to be analyzed first and if it happens that the issue really > > > > > > > needs to be fixed with telling the kernel that userspace has completed > > > > > > > booting, eg: because the problem is not in a few callsites that need conversion > > > > > > > to expedited but instead in the accumulation of lots of calls that should stay > > > > > > > as is: > > > > > > > > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > > > > > may run right after the boot. Either you choose a value that is too low > > > > > > > and you miss the optimization or the value is too high and you may break > > > > > > > things. > > > > > > > > > > > > > > 4) This should be fixed the way you did: > > > > > > > a) a kernel parameter like you did > > > > > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > > > > > has completed booting. > > > > > > > c) Make these interfaces more generic, maybe that information will be useful > > > > > > > outside RCU. For example the kernel parameter should be > > > > > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > > > > > kernel.user_booted = 1 > > > > > > > d) But yuck, this means we must know if the init process supports that... > > > > > > > > > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > > > > > > > > > Thanks. > > > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > > > > > parameter that can be used during the boot. For example on our devices > > > > > > to speedup a boot we boot the kernel with rcu_expedited: > > > > > > > > > > > > XQ-DQ54:/ # cat /proc/cmdline > > > > > > XQ-DQ54:/ # > > > > > > > > > > > > then a user space can decides if it is needed or not: > > > > > > > > > > > > <snip> > > > > > > rcu_expedited rcu_normal > > > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > > > > > XQ-DQ54:/ # > > > > > > <snip> > > > > > > > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > > > > > true or false. So we can follow and be aligned with rcu_expedited and > > > > > > rcu_normal parameters. > > > > > > > > > > Speaking of aligning, there is also the automated > > > > > rcu_normal_after_boot boot option correct? I prefer the automated > > > > > option of doing this. So the approach here is not really unprecedented > > > > > and is much more robust than relying on userspace too much (I am ok > > > > > with adding your suggestion *on top* of the automated toggle, but I > > > > > probably would not have ChromeOS use it if the automated way exists). > > > > > Or did I miss something? > > > > > > > > See this commit: > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives") > > > > > > > > Antti provided this commit precisely in order to allow Android devices > > > > to expedite the boot process and to shut off the expediting at a time of > > > > Android userspace's choosing. So Android has been making this work for > > > > about ten years, which strikes me as an adequate proof of concept. ;-) > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I find that > > > Android Mediatek devices at least are setting rcu_expedited to 1 at late > > > stage of their userspace boot (which is weird, it should be set to 1 as early > > > as possible), and interestingly I cannot find them resetting it back to 0!. > > > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > Interesting. Though this is consistent with Antti's commit log, where > > he talks about expediting grace periods but not unexpediting them. > > > Do you think we need to unexpedite it? :)))) Android runs on smallish systems, so quite possibly not! Thanx, Paul
On Wed, Mar 08, 2023 at 05:52:50AM -0800, Joel Fernandes wrote: > Just to add to previous reply: > > One thing to consider is that it is more of a performance improvement for > booting in expedited mode to fallback to normal later, than a bug > fix. Repeated synchronize_rcu() can easily add 100s of milliseconds and to > remedy that — a conversion of the call from normal API to the expedited API > will not help. 2 things to consider: 1) Is it this about specific calls to synchronize_rcu() that repeat a lot and thus create such measurable impact? If so the specific callsites should be considered for a conversion. 2) Is it about lots of different calls to synchronize_rcu() that gather a big noise? Then the solution is different. Again without proper analysis, what do we know? Thanks.
> On Mar 8, 2023, at 7:01 AM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Wed, Mar 08, 2023 at 05:52:50AM -0800, Joel Fernandes wrote: >> Just to add to previous reply: >> >> One thing to consider is that it is more of a performance improvement for >> booting in expedited mode to fallback to normal later, than a bug >> fix. Repeated synchronize_rcu() can easily add 100s of milliseconds and to >> remedy that — a conversion of the call from normal API to the expedited API >> will not help. > > 2 things to consider: > > 1) Is it this about specific calls to synchronize_rcu() that repeat a lot > and thus create such measurable impact? If so the specific callsites should > be considered for a conversion. > > 2) Is it about lots of different calls to synchronize_rcu() that gather a big > noise? Then the solution is different. > > Again without proper analysis, what do we know? Again, no one disputed that proper analysis is needed. That is obvious. I was just responding to your assumption that if boot is slow, user space will also be slow. That is not a good thing to conclude because there are many factors. Slowness at boot may be considered a bug, but slowness after boot may not be (say if the user care mores for power later). On my side I am planning to dig deeper into our boot process, but it will take time. I hope Qiuxu can do the boot analysis on his side. Thanks. > > Thanks.
On Wed, Mar 08, 2023 at 06:45:28AM -0800, Paul E. McKenney wrote: > On Wed, Mar 08, 2023 at 10:41:19AM +0100, Uladzislau Rezki wrote: > > On Tue, Mar 07, 2023 at 11:27:26AM -0800, Paul E. McKenney wrote: > > > On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote: > > > > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote: > > > > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > > > > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > > > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > > > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > > > > > > the system has really booted from the kernel side. Some features like > > > > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > > > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > > > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > > > > > > boot (as shown in the data below). > > > > > > > > > > > > > > > > > > For these reasons, this commit adds a config option > > > > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > > > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > > > > > > Or even just writing a value of 0 to this sysfs node. > > > > > > > > > However, under no circumstance will the boot be allowed to end earlier > > > > > > > > > than just before init is launched. > > > > > > > > > > > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > > > > > > no config or parameter changes, and just a simple application of this patch. A > > > > > > > > > system designer can also choose a specific value here to keep RCU from marking > > > > > > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > > > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > > > > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > > > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > > > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > > > > > > such an issue (however unlikely), the user should either tune > > > > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > > > > > > boots, and before launching the real-time workload. > > > > > > > > > > > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > > > > > of patch. An excerpt from the data he shared: > > > > > > > > > > > > > > > > > > 1) Testing environment: > > > > > > > > > OS : CentOS Stream 8 (non-RT OS) > > > > > > > > > Kernel : v6.2 > > > > > > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > > > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > > > > > > > > > > > 2) OS boot time definition: > > > > > > > > > The time from the start of the kernel boot to the shell command line > > > > > > > > > prompt is shown from the console. [ Different people may have > > > > > > > > > different OS boot time definitions. ] > > > > > > > > > > > > > > > > > > 3) Measurement method (very rough method): > > > > > > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > > > > > > As soon as the shell command line prompt is shown from the console, > > > > > > > > > we record the boot time printed by the timer, then the printed boot > > > > > > > > > time is the OS boot time. > > > > > > > > > > > > > > > > > > 4) Measured OS boot time (in seconds) > > > > > > > > > a) Measured 10 times w/o this patch: > > > > > > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > > > > > > The average OS boot time was: ~8.7s > > > > > > > > > > > > > > > > > > b) Measure 10 times w/ this patch: > > > > > > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > > > > > > The average OS boot time was: ~8.3s. > > > > > > > > > > > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > > > > > > > > > I still don't really like that: > > > > > > > > > > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > > > > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > > > > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > > > > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > > > > > > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > > > > > > userspace code calling into the kernel. Is that piece of code also called > > > > > > > > after the boot? In that case are we missing a conversion from > > > > > > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > > > > > > the problem is more general than just boot. > > > > > > > > > > > > > > > > This needs to be analyzed first and if it happens that the issue really > > > > > > > > needs to be fixed with telling the kernel that userspace has completed > > > > > > > > booting, eg: because the problem is not in a few callsites that need conversion > > > > > > > > to expedited but instead in the accumulation of lots of calls that should stay > > > > > > > > as is: > > > > > > > > > > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > > > > > > may run right after the boot. Either you choose a value that is too low > > > > > > > > and you miss the optimization or the value is too high and you may break > > > > > > > > things. > > > > > > > > > > > > > > > > 4) This should be fixed the way you did: > > > > > > > > a) a kernel parameter like you did > > > > > > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > > > > > > has completed booting. > > > > > > > > c) Make these interfaces more generic, maybe that information will be useful > > > > > > > > outside RCU. For example the kernel parameter should be > > > > > > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > > > > > > kernel.user_booted = 1 > > > > > > > > d) But yuck, this means we must know if the init process supports that... > > > > > > > > > > > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > > > > > > > > > > > Thanks. > > > > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > > > > > > parameter that can be used during the boot. For example on our devices > > > > > > > to speedup a boot we boot the kernel with rcu_expedited: > > > > > > > > > > > > > > XQ-DQ54:/ # cat /proc/cmdline > > > > > > > XQ-DQ54:/ # > > > > > > > > > > > > > > then a user space can decides if it is needed or not: > > > > > > > > > > > > > > <snip> > > > > > > > rcu_expedited rcu_normal > > > > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > > > > > > XQ-DQ54:/ # > > > > > > > <snip> > > > > > > > > > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > > > > > > true or false. So we can follow and be aligned with rcu_expedited and > > > > > > > rcu_normal parameters. > > > > > > > > > > > > Speaking of aligning, there is also the automated > > > > > > rcu_normal_after_boot boot option correct? I prefer the automated > > > > > > option of doing this. So the approach here is not really unprecedented > > > > > > and is much more robust than relying on userspace too much (I am ok > > > > > > with adding your suggestion *on top* of the automated toggle, but I > > > > > > probably would not have ChromeOS use it if the automated way exists). > > > > > > Or did I miss something? > > > > > > > > > > See this commit: > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives") > > > > > > > > > > Antti provided this commit precisely in order to allow Android devices > > > > > to expedite the boot process and to shut off the expediting at a time of > > > > > Android userspace's choosing. So Android has been making this work for > > > > > about ten years, which strikes me as an adequate proof of concept. ;-) > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I find that > > > > Android Mediatek devices at least are setting rcu_expedited to 1 at late > > > > stage of their userspace boot (which is weird, it should be set to 1 as early > > > > as possible), and interestingly I cannot find them resetting it back to 0!. > > > > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > Interesting. Though this is consistent with Antti's commit log, where > > > he talks about expediting grace periods but not unexpediting them. > > > > > Do you think we need to unexpedite it? :)))) > > Android runs on smallish systems, so quite possibly not! > We keep it enabled and never unexpedite it. The reason is a performance. I have done some app-launch time analysis with enabling and disabling of it. An expedited case is much better when it comes to app launch time. It requires ~25% less time to run an app comparing with unexpedited variant. So we have a big gain here. -- Uladzislau Rezki
> From: Paul E. McKenney <paulmck@kernel.org> > [...] > > > > a's standard deviation is ~0.4. > > b's standard deviation is ~0.5. > > > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. > > So, the measurements should be statistically significant to some degree. > > That single standard deviation means that you have 68% confidence that the > difference is real. This is not far above the 50% leval of random noise. > 95% is the lowest level that is normally considered to be statistically > significant. 95% means there is no overlap between two standard deviations of a and two standard deviations of b. This relies on either much less noise during testing or a big enough difference between a and b. > > The calculated standard deviations are via: > > https://www.gigacalculator.com/calculators/standard-deviation-calculat > > or.php > > Fair enough. Formulas are readily available as well, and most spreadsheets > support standard deviation. > > [...] > > > > Why don't you try applying this approach to the new data? You will > > > need the general binomial formula. > > > > Thank you Paul for the suggestion. > > I just tried it, but not sure whether my analysis was correct ... > > > > Analysis 1: > > a's median is 8.9. > > I get 8.95, which is the average of the 24th and 25th members of a in > numerical order. Yes, it should be 8.95. Thanks for correcting me. > > 35/48 b's data points are less than 0.1 less than a's median. > > For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5. > > So, we have strong confidence that b is 100ms faster than a. > > I of course get quite a bit stronger confidence, but your 99.9% is good > enough. And I get even stronger confidence going in the other direction. > However, the fact that a's median varies from 8.7 in the old experiment to > 8.95 in this experiment does give some pause. These are after all supposedly > drawn from the same distribution. Or did you use a different machine or > different OS version or some such in the two sets of measurements? > Different time of day and thus different ambient temperature, thus different > CPU clock frequency? All the testing setups were identical except for the testing time. Old a median : 8.7 New a median : 8.95 Old b median : 8.2 New b median : 8.45 I'm a bit surprised that both new medians are exactly greater 0.25 more than the old medians. Coincidence? > Assuming identical test setups, let's try the old value of 8.7 from old a to new > b. There are 14 elements in new b greater than 8.6, for a probability of > 0.17%, or about 98.3% significance. This is still OK. > > In contrast, the median of the old b is 8.2, which gives extreme confidence. > So let's be conservative and use the large-set median. > > In real life, additional procedures would be needed to estimate the > confidence in the median, which turns oout to be nontrivial. When I apply Luckily, I could just simply pick up the medians in numerical order in this case. ;-) > this sort of technique, I usually have all data from each sample being on one > side of the median of the other, which simplifies things. ;-) I like all data points are on one side of the median of the other ;-) But this also relies on either much less noise during testing or a big enough difference between a and b, right? > The easiest way to estimate bounds on the median is to "bootstrap", but that > works best if you have 1000 samples and can randomly draw 1000 sub- > samples each of size 10 from the larger sample and compute the median of > each. You can sort these medians and obtain a cumulative distribution. Good to know "bootstap". > But you have to have an extremely good reason to collect data from 1000 > boots, and I don't believe we have that good of a reason. > 1000 boots, Oh my ... No. No. I don't have a good reason for that ;-) > > Analysis 2: > > a's median - 0.4 = 8.9 - 0.4 = 8.5. > > 24/48 b's data points are less than 0.4 less than a's median. > > The probability that a's data points are less than 8.5 is p = 7/48 > > = 0.1458 > This is only 85.4% significant, so... > > > For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458. > > So, looks like we have confidence that b is 400ms faster than a. > > ...we really cannot say anything about 400ms faster. Again, you need 95% > and preferably 99% to really make any sort of claim. You probably need > quite a few more samples to say much about 200ms, let alone 400ms. OK. Thanks for correcting me. > > Plus, you really should select the speedup and only then take the > measurements. Otherwise, you end up fitting noise. > > However, assuming identical tests setups, you really can calculate the median > from the full data set. > > > The calculated cumulative binomial distributions P(X) is via: > > > > https://www.gigacalculator.com/calculators/binomial-probability-calcul > > ator.php > > The maxima program's binomial() function agrees with it, so good. ;-) > > > I apologize if this analysis/discussion bored some of you. ;-) > > Let's just say that it is a lot simpler when you are measuring larger > differences in data with tighter distributions. Me, I usually just say "no" to > drawing any sort of conclusion from data sets that overlap this much. > Instead, I might check to see if there is some random events adding noise to > the boot duration, eliminate that, and hopefully get data that is easier to > analyze. Agree. > But I am good with the 98.3% confidence in a 100ms improvement. > > So if Joel wishes to make this point, he should feel free to take both of your > datasets and use the computation with the worse mean. Thank you so much Paul for your patience and detailed comments. -Qiuxu
On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote: > > From: Paul E. McKenney <paulmck@kernel.org> > > [...] > > > > > > a's standard deviation is ~0.4. > > > b's standard deviation is ~0.5. > > > > > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. > > > So, the measurements should be statistically significant to some degree. > > > > That single standard deviation means that you have 68% confidence that the > > difference is real. This is not far above the 50% leval of random noise. > > 95% is the lowest level that is normally considered to be statistically > > significant. > > 95% means there is no overlap between two standard deviations of a > and two standard deviations of b. > > This relies on either much less noise during testing or a big enough > difference between a and b. > > > > The calculated standard deviations are via: > > > https://www.gigacalculator.com/calculators/standard-deviation-calculat > > > or.php > > > > Fair enough. Formulas are readily available as well, and most spreadsheets > > support standard deviation. > > > > [...] > > > > > > Why don't you try applying this approach to the new data? You will > > > > need the general binomial formula. > > > > > > Thank you Paul for the suggestion. > > > I just tried it, but not sure whether my analysis was correct ... > > > > > > Analysis 1: > > > a's median is 8.9. > > > > I get 8.95, which is the average of the 24th and 25th members of a in > > numerical order. > > Yes, it should be 8.95. Thanks for correcting me. > > > > 35/48 b's data points are less than 0.1 less than a's median. > > > For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5. > > > So, we have strong confidence that b is 100ms faster than a. > > > > I of course get quite a bit stronger confidence, but your 99.9% is good > > enough. And I get even stronger confidence going in the other direction. > > However, the fact that a's median varies from 8.7 in the old experiment to > > 8.95 in this experiment does give some pause. These are after all supposedly > > drawn from the same distribution. Or did you use a different machine or > > different OS version or some such in the two sets of measurements? > > Different time of day and thus different ambient temperature, thus different > > CPU clock frequency? > > All the testing setups were identical except for the testing time. > > Old a median : 8.7 > New a median : 8.95 > > Old b median : 8.2 > New b median : 8.45 > > I'm a bit surprised that both new medians are exactly greater 0.25 more than > the old medians. Coincidence? Possibly some semi-rare race condition makes boot take longer, and 48 boots has a higher probability of getting more of them? But without analyzing the boot sequence, your guess is as good as mine. > > Assuming identical test setups, let's try the old value of 8.7 from old a to new > > b. There are 14 elements in new b greater than 8.6, for a probability of > > 0.17%, or about 98.3% significance. This is still OK. > > > > In contrast, the median of the old b is 8.2, which gives extreme confidence. > > So let's be conservative and use the large-set median. > > > > In real life, additional procedures would be needed to estimate the > > confidence in the median, which turns oout to be nontrivial. When I apply > > Luckily, I could just simply pick up the medians in numerical order in this case. ;-) > > > this sort of technique, I usually have all data from each sample being on one > > side of the median of the other, which simplifies things. ;-) > > I like all data points are on one side of the median of the other ;-) > > But this also relies on either much less noise during testing or a big enough > difference between a and b, right? Yes, life is indeed *much* easier when there is less noise or larger differences. ;-) > > The easiest way to estimate bounds on the median is to "bootstrap", but that > > works best if you have 1000 samples and can randomly draw 1000 sub- > > samples each of size 10 from the larger sample and compute the median of > > each. You can sort these medians and obtain a cumulative distribution. > > Good to know "bootstap". > > > But you have to have an extremely good reason to collect data from 1000 > > boots, and I don't believe we have that good of a reason. > > > > 1000 boots, Oh my ... > No. No. I don't have a good reason for that ;-) > > > > Analysis 2: > > > a's median - 0.4 = 8.9 - 0.4 = 8.5. > > > 24/48 b's data points are less than 0.4 less than a's median. > > > The probability that a's data points are less than 8.5 is p = 7/48 > > > = 0.1458 > > This is only 85.4% significant, so... > > > > > For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458. > > > So, looks like we have confidence that b is 400ms faster than a. > > > > ...we really cannot say anything about 400ms faster. Again, you need 95% > > and preferably 99% to really make any sort of claim. You probably need > > quite a few more samples to say much about 200ms, let alone 400ms. > > OK. Thanks for correcting me. > > > Plus, you really should select the speedup and only then take the > > measurements. Otherwise, you end up fitting noise. > > > > However, assuming identical tests setups, you really can calculate the median > > from the full data set. > > > > > The calculated cumulative binomial distributions P(X) is via: > > > > > > https://www.gigacalculator.com/calculators/binomial-probability-calcul > > > ator.php > > > > The maxima program's binomial() function agrees with it, so good. ;-) > > > > > I apologize if this analysis/discussion bored some of you. ;-) > > > > Let's just say that it is a lot simpler when you are measuring larger > > differences in data with tighter distributions. Me, I usually just say "no" to > > drawing any sort of conclusion from data sets that overlap this much. > > Instead, I might check to see if there is some random events adding noise to > > the boot duration, eliminate that, and hopefully get data that is easier to > > analyze. > > Agree. > > > But I am good with the 98.3% confidence in a 100ms improvement. > > > > So if Joel wishes to make this point, he should feel free to take both of your > > datasets and use the computation with the worse mean. > > Thank you so much Paul for your patience and detailed And thank you for bearing with me. Thanx, Paul
On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: [..] > > > > > > See this commit: > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > expedited RCU primitives") > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > devices to expedite the boot process and to shut off the > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > has been making this work for about ten years, which strikes me > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > find that Android Mediatek devices at least are setting > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > weird, it should be set to 1 as early as possible), and > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > where he talks about expediting grace periods but not unexpediting > > > > them. > > > > > > > Do you think we need to unexpedite it? :)))) > > > > Android runs on smallish systems, so quite possibly not! > > > We keep it enabled and never unexpedite it. The reason is a performance. I > have done some app-launch time analysis with enabling and disabling of it. > > An expedited case is much better when it comes to app launch time. It > requires ~25% less time to run an app comparing with unexpedited variant. > So we have a big gain here. Wow, that's huge. I wonder if you can dig deeper and find out why that is so as the callbacks may need to be synchronize_rcu_expedited() then, as it could be slowing down other usecases! I find it hard to believe, real-time workloads will run better without those callbacks being always-expedited if it actually gives back 25% in performance! thanks, - Joel
Hi, Let me chime in this interesting thread. On Thu, 9 Mar 2023 13:53:39 -0800, Paul E. McKenney wrote: > On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote: >> > From: Paul E. McKenney <paulmck@kernel.org> >> > [...] >> > > >> > > a's standard deviation is ~0.4. >> > > b's standard deviation is ~0.5. >> > > >> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. >> > > So, the measurements should be statistically significant to some degree. >> > >> > That single standard deviation means that you have 68% confidence that the >> > difference is real. This is not far above the 50% leval of random noise. >> > 95% is the lowest level that is normally considered to be statistically >> > significant. >> >> 95% means there is no overlap between two standard deviations of a >> and two standard deviations of b. >> >> This relies on either much less noise during testing or a big enough >> difference between a and b. Appended is a histogram comparing 2 data sets. As you see, the one with v2 patch is far from normal distribution. I think there is at least two peaks. The one at the right around 9.7 seems not affected by the patch. In such a case, average and standard deviation of all the data don't tell much. It is hard to say anything for sure with such small set of samples. And the shape of the plot is likely to be highly dependent on machine setups. Hope this helps. Thanks, Akira >> [...]
On Fri, Mar 10, 2023 at 09:11:54AM +0900, Akira Yokosawa wrote: > Hi, > > Let me chime in this interesting thread. > > On Thu, 9 Mar 2023 13:53:39 -0800, Paul E. McKenney wrote: > > On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote: > >> > From: Paul E. McKenney <paulmck@kernel.org> > >> > [...] > >> > > > >> > > a's standard deviation is ~0.4. > >> > > b's standard deviation is ~0.5. > >> > > > >> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. > >> > > So, the measurements should be statistically significant to some degree. > >> > > >> > That single standard deviation means that you have 68% confidence that the > >> > difference is real. This is not far above the 50% leval of random noise. > >> > 95% is the lowest level that is normally considered to be statistically > >> > significant. > >> > >> 95% means there is no overlap between two standard deviations of a > >> and two standard deviations of b. > >> > >> This relies on either much less noise during testing or a big enough > >> difference between a and b. > > Appended is a histogram comparing 2 data sets. > > As you see, the one with v2 patch is far from normal distribution. > I think there is at least two peaks. > The one at the right around 9.7 seems not affected by the patch. > In such a case, average and standard deviation of all the data don't > tell much. > > It is hard to say anything for sure with such small set of samples. > And the shape of the plot is likely to be highly dependent on machine > setups. > > Hope this helps. Thank you, Akira! Definitely an abnormal distribution! ;-) Thanx, Paul
> From: Akira Yokosawa <akiyks@gmail.com> > Sent: Friday, March 10, 2023 8:12 AM > To: paulmck@kernel.org; Zhuo, Qiuxu <qiuxu.zhuo@intel.com> > Cc: frederic@kernel.org; jiangshanlai@gmail.com; joel@joelfernandes.org; > linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > rcu@vger.kernel.org; urezki@gmail.com; Akira Yokosawa > <akiyks@gmail.com> > Subject: Re: [PATCH v3] rcu: Add a minimum time for marking boot as > completed > > Hi, > > Let me chime in this interesting thread. > > On Thu, 9 Mar 2023 13:53:39 -0800, Paul E. McKenney wrote: > > On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote: > >> > From: Paul E. McKenney <paulmck@kernel.org> [...] > >> > > > >> > > a's standard deviation is ~0.4. > >> > > b's standard deviation is ~0.5. > >> > > > >> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, > 9]. > >> > > So, the measurements should be statistically significant to some > degree. > >> > > >> > That single standard deviation means that you have 68% confidence > >> > that the difference is real. This is not far above the 50% leval of random > noise. > >> > 95% is the lowest level that is normally considered to be > >> > statistically significant. > >> > >> 95% means there is no overlap between two standard deviations of a > >> and two standard deviations of b. > >> > >> This relies on either much less noise during testing or a big enough > >> difference between a and b. > > Appended is a histogram comparing 2 data sets. > > As you see, the one with v2 patch is far from normal distribution. > I think there is at least two peaks. > The one at the right around 9.7 seems not affected by the patch. > In such a case, average and standard deviation of all the data don't tell much. > > It is hard to say anything for sure with such small set of samples. > And the shape of the plot is likely to be highly dependent on machine setups. > > Hope this helps. Thank you Yokosawa for sharing the histogram to provide an intuitive view of these data points and your analysis. ;-) -Qiuxu
On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > [..] > > > > > > > See this commit: > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > find that Android Mediatek devices at least are setting > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > where he talks about expediting grace periods but not unexpediting > > > > > them. > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > have done some app-launch time analysis with enabling and disabling of it. > > > > An expedited case is much better when it comes to app launch time. It > > requires ~25% less time to run an app comparing with unexpedited variant. > > So we have a big gain here. > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > be slowing down other usecases! I find it hard to believe, real-time > workloads will run better without those callbacks being always-expedited if > it actually gives back 25% in performance! > I can dig further, but on a high level i think there are some spots which show better performance if expedited is set. I mean synchronize_rcu() becomes as "less blocking a context" from a time point of view. The problem of a regular synchronize_rcu() is - it can trigger a big latency delays for a caller. For example for nocb case we do not know where in a list our callback is located and when it is invoked to unblock a caller. I have already mentioned somewhere. Probably it makes sense to directly wake-up callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks one by one. -- Uladzislau Rezki
On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > [..] > > > > > > > > See this commit: > > > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > > find that Android Mediatek devices at least are setting > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > > where he talks about expediting grace periods but not unexpediting > > > > > > them. > > > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > > have done some app-launch time analysis with enabling and disabling of it. > > > > > > An expedited case is much better when it comes to app launch time. It > > > requires ~25% less time to run an app comparing with unexpedited variant. > > > So we have a big gain here. > > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > be slowing down other usecases! I find it hard to believe, real-time > > workloads will run better without those callbacks being always-expedited if > > it actually gives back 25% in performance! > > > I can dig further, but on a high level i think there are some spots > which show better performance if expedited is set. I mean synchronize_rcu() > becomes as "less blocking a context" from a time point of view. > > The problem of a regular synchronize_rcu() is - it can trigger a big latency > delays for a caller. For example for nocb case we do not know where in a list > our callback is located and when it is invoked to unblock a caller. True, expedited RCU grace periods do not have this callback-invocation delay that normal RCU does. > I have already mentioned somewhere. Probably it makes sense to directly wake-up > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > one by one. Makes sense, but it is necessary to be careful. Wakeups are not fast, so making the RCU grace-period kthread do them all sequentially is not a strategy to win. For example, note that the next expedited grace period can start before the previous expedited grace period has finished its wakeups. Thanx, Paul
On Sat, Mar 11, 2023 at 1:24 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > [..] > > > > > > > > > See this commit: > > > > > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > > > find that Android Mediatek devices at least are setting > > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > > > where he talks about expediting grace periods but not unexpediting > > > > > > > them. > > > > > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > > > have done some app-launch time analysis with enabling and disabling of it. > > > > > > > > An expedited case is much better when it comes to app launch time. It > > > > requires ~25% less time to run an app comparing with unexpedited variant. > > > > So we have a big gain here. > > > > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > be slowing down other usecases! I find it hard to believe, real-time > > > workloads will run better without those callbacks being always-expedited if > > > it actually gives back 25% in performance! > > > > > I can dig further, but on a high level i think there are some spots > > which show better performance if expedited is set. I mean synchronize_rcu() > > becomes as "less blocking a context" from a time point of view. > > > > The problem of a regular synchronize_rcu() is - it can trigger a big latency > > delays for a caller. For example for nocb case we do not know where in a list > > our callback is located and when it is invoked to unblock a caller. > > True, expedited RCU grace periods do not have this callback-invocation > delay that normal RCU does. > > > I have already mentioned somewhere. Probably it makes sense to directly wake-up > > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > one by one. > > Makes sense, but it is necessary to be careful. Wakeups are not fast, > so making the RCU grace-period kthread do them all sequentially is not > a strategy to win. For example, note that the next expedited grace > period can start before the previous expedited grace period has finished > its wakeups. The kthreads could be undergoing scheduler contention too especially since the workload is launching an app if I understand Vlad's usecase. Hence my desire for a rcutop one-stop tool which shows all these things (rcu kthread scheduler delays, callback latencies, etc etc). ;-) The more and more I run into issues, the more that tool becomes urgent which I'm working on... thanks, - Joel
On Sat, Mar 04, 2023 at 04:51:45AM +0000, Joel Fernandes wrote: > Hi Paul, > > On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote: > [..] > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > of patch. An excerpt from the data he shared: Now that we have the measurement methodology put to bed... [ . . . ] > > Mightn't this be simpler if the user was only permitted to write zero, > > thus just saying "stop immediately"? If people really need the ability > > to extend or shorten the time, a patch can be produced at that point. > > And then a non-zero write to the file would become legal. > > I prefer to keep it this way as with this method, I can not only get to > have variable rcu_boot_end_delay via boot parameter (as in my first patch), I > also don't need to add a separate sysfs entry, and can just reuse > 'rcu_boot_end_delay' parameter, which I also had in my first patch. And > adding yet another sysfs parameter will actually complicate it even more and > add more lines of code. > > I tested difference scenarios and it works fine, though I missed that > mutex locking unfortunately, I did verify different test cases work as > expected by manual testing. Except that you don't need that extra sysfs value. You could instead use any of a number of state variables that tell you that early boot is done. If the state says early boot (as in parsing the kernel command line), make the code act as it does now. Otherwise, make it accept only zero. If there really is some system that wants to set one time limit via the kernel boot parameter and set another at some time during boot, there are very simple userspace facilities to make this happen. And there is also a smaller state space and less testing to be done, benefits which accrue on an ongoing basis. Thanx, Paul > Here are some printks and on simple testing in Qemu: > > 1. End the boot early, CONFIG is set to 120 seconds: > ================================================== > [ 1.614968] rcu_boot_end_delay = 120000 > [ 1.617630] schedule delayed work joel > > Boot took 1.57 seconds > root@(none):/# cat /sys/module/rcupdate/parameters/rcu_boot_end_delay > 120000 > root@(none):/# > root@(none):/# > root@(none):/# echo 0 > /sys/module/rcupdate/parameters/rcu_boot_end_delay > [ 10.108394] param called joel > [ 10.110520] sys calling boot ended > [ 10.112730] rcu_boot_end_delay = 0 > [ 10.115017] boot ended joel > ----------------------------------------------- > > 2. End the boot passing in rcupdate.rcu_boot_end_delay as 10s. > This should overwride the CONFIG of 120 seconds: > ================================================== > [ 1.700090] rcu_boot_end_delay = 10000 > [ 1.702628] schedule delayed work joel > > Boot took 1.64 seconds > > root@(none):/# [ 10.414008] rcu_boot_end_delay = 10000 > [ 10.416670] boot ended joel > ----------------------------------------------- > > 3. Do the same thing as #2, but extend the boot via sysfs to be longer than > 10 seconds: > ================================================== > [ 0.060025] param called joel > [ 0.060026] param called too early joel > [ 1.663905] rcu_boot_end_delay = 10000 > [ 1.667051] schedule delayed work joel > > Boot took 1.61 seconds > > root@(none):/# > root@(none):/# echo 20000 > /sys/module/rcupdate/parameters/rcu_boot_end_delay > [ 6.932517] param called joel > [ 6.934637] sys calling boot ended > [ 6.936845] rcu_boot_end_delay = 20000 > [ 6.939291] schedule delayed work joel > root@(none):/# [ 10.389366] rcu_boot_end_delay = 20000 > [ 10.392047] schedule delayed work joel > [ 20.117416] rcu_boot_end_delay = 20000 > [ 20.120073] boot ended joel > ----------------------------------------------- > > The debug patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/lazy/postboot > > Appended is the updated v4 patch, tested as shown above, more testing is in progress. > > thanks, > > - Joel > > ---8<----------------------- > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org> > Subject: [PATCH v4] rcu: Add a minimum time for marking boot as completed > > On many systems, a great deal of boot (in userspace) happens after the > kernel thinks the boot has completed. It is difficult to determine if > the system has really booted from the kernel side. Some features like > lazy-RCU can risk slowing down boot time if, say, a callback has been > added that the boot synchronously depends on. Further expedited callbacks > can get unexpedited way earlier than it should be, thus slowing down > boot (as shown in the data below). > > For these reasons, this commit adds a config option > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > Userspace can also make RCU's view of the system as booted, by writing the > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > Or even just writing a value of 0 to this sysfs node. > However, under no circumstance will the boot be allowed to end earlier > than just before init is launched. > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > suites ChromeOS and also a PREEMPT_RT system below very well, which need > no config or parameter changes, and just a simple application of this patch. A > system designer can also choose a specific value here to keep RCU from marking > boot completion. As noted earlier, RCU's perspective of the system as booted > will not be marker until at least rcu_boot_end_delay milliseconds have passed > or an update is made via writing a small value (or 0) in milliseconds to: > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > One side-effect of this patch is, there is a risk that a real-time workload > launched just after the kernel boots will suffer interruptions due to expedited > RCU, which previous ended just before init was launched. However, to mitigate > such an issue (however unlikely), the user should either tune > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > boots, and before launching the real-time workload. > > Qiuxu also noted impressive boot-time improvements with earlier version > of patch. An excerpt from the data he shared: > > 1) Testing environment: > OS : CentOS Stream 8 (non-RT OS) > Kernel : v6.2 > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > 2) OS boot time definition: > The time from the start of the kernel boot to the shell command line > prompt is shown from the console. [ Different people may have > different OS boot time definitions. ] > > 3) Measurement method (very rough method): > A timer in the kernel periodically prints the boot time every 100ms. > As soon as the shell command line prompt is shown from the console, > we record the boot time printed by the timer, then the printed boot > time is the OS boot time. > > 4) Measured OS boot time (in seconds) > a) Measured 10 times w/o this patch: > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > The average OS boot time was: ~8.7s > > b) Measure 10 times w/ this patch: > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > The average OS boot time was: ~8.3s. > > option-prefix PATCH v4 > option-start > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > diff-note-start > v1->v2: > Update some comments and description. > v2->v3: > Add sysfs param, and update with Test data. > v3->v4: > Fix locking bug found by Paul, make code more robust > by refactoring locking code. > Doc updates. > --- > .../admin-guide/kernel-parameters.txt | 15 ++++ > cc_list | 8 ++ > kernel/rcu/Kconfig | 21 ++++++ > kernel/rcu/update.c | 74 ++++++++++++++++++- > 4 files changed, 116 insertions(+), 2 deletions(-) > create mode 100644 cc_list > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 2429b5e3184b..878c2780f5db 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5085,6 +5085,21 @@ > rcutorture.verbose= [KNL] > Enable additional printk() statements. > > + rcupdate.rcu_boot_end_delay= [KNL] > + Minimum time in milliseconds from the start of boot > + that must elapse before the boot sequence can be marked > + complete from RCU's perspective, after which RCU's > + behavior becomes more relaxed. The default value is also > + configurable via CONFIG_RCU_BOOT_END_DELAY. > + Userspace can also mark the boot as completed > + sooner by writing the time in milliseconds, say once > + userspace considers the system as booted, to: > + /sys/module/rcupdate/parameters/rcu_boot_end_delay > + Or even just writing a value of 0 to this sysfs node. > + The sysfs node can also be used to extend the delay > + to be larger than the default, assuming the marking > + of boot complete has not yet occurred. > + > rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] > Dump ftrace buffer after reporting RCU CPU > stall warning. > diff --git a/cc_list b/cc_list > new file mode 100644 > index 000000000000..7daed4877f5a > --- /dev/null > +++ b/cc_list > @@ -0,0 +1,8 @@ > +Frederic Weisbecker <frederic@kernel.org> > +Joel Fernandes <joel@joelfernandes.org> > +Lai Jiangshan <jiangshanlai@gmail.com> > +linux-doc@vger.kernel.org > +linux-kernel@vger.kernel.org > +"Paul E. McKenney" <paulmck@kernel.org> > +rcu@vger.kernel.org > +urezki@gmail.com > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > index 9071182b1284..97f68120d1c0 100644 > --- a/kernel/rcu/Kconfig > +++ b/kernel/rcu/Kconfig > @@ -217,6 +217,27 @@ config RCU_BOOST_DELAY > > Accept the default if unsure. > > +config RCU_BOOT_END_DELAY > + int "Minimum time before RCU may consider in-kernel boot as completed" > + range 0 120000 > + default 15000 > + help > + Default value of the minimum time in milliseconds from the start of boot > + that must elapse before the boot sequence can be marked complete from RCU's > + perspective, after which RCU's behavior becomes more relaxed. > + Userspace can also mark the boot as completed sooner than this default > + by writing the time in milliseconds, say once userspace considers > + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. > + Or even just writing a value of 0 to this sysfs node. The sysfs node can > + also be used to extend the delay to be larger than the default, assuming > + the marking of boot completion has not yet occurred. > + > + The actual delay for RCU's view of the system to be marked as booted can be > + higher than this value if the kernel takes a long time to initialize but it > + will never be smaller than this value. > + > + Accept the default if unsure. > + > config RCU_EXP_KTHREAD > bool "Perform RCU expedited work in a real-time kthread" > depends on RCU_BOOST && RCU_EXPERT > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 19bf6fa3ee6a..18ed3c15e6b5 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -224,13 +224,50 @@ void rcu_unexpedite_gp(void) > } > EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); > > +/* > + * Minimum time in milliseconds from the start boot until RCU can consider > + * in-kernel boot as completed. This can also be tuned at runtime to end the > + * boot earlier, by userspace init code writing the time in milliseconds (even > + * 0) to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. The sysfs node > + * can also be used to extend the delay to be larger than the default, assuming > + * the marking of boot complete has not yet occurred. > + */ > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; > + > static bool rcu_boot_ended __read_mostly; > +static bool rcu_boot_end_called __read_mostly; > +static DEFINE_MUTEX(rcu_boot_end_lock); > > /* > - * Inform RCU of the end of the in-kernel boot sequence. > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. > */ > -void rcu_end_inkernel_boot(void) > +void rcu_end_inkernel_boot(void); > +static void rcu_boot_end_work_fn(struct work_struct *work) > +{ > + rcu_end_inkernel_boot(); > +} > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); > + > +/* Must be called with rcu_boot_end_lock held. */ > +static void rcu_end_inkernel_boot_locked(void) > { > + rcu_boot_end_called = true; > + > + if (rcu_boot_ended) > + return; > + > + if (rcu_boot_end_delay) { > + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); > + > + if (boot_ms < rcu_boot_end_delay) { > + schedule_delayed_work(&rcu_boot_end_work, > + rcu_boot_end_delay - boot_ms); > + return; > + } > + } > + > + cancel_delayed_work(&rcu_boot_end_work); > rcu_unexpedite_gp(); > rcu_async_relax(); > if (rcu_normal_after_boot) > @@ -238,6 +275,39 @@ void rcu_end_inkernel_boot(void) > rcu_boot_ended = true; > } > > +void rcu_end_inkernel_boot(void) > +{ > + mutex_lock(&rcu_boot_end_lock); > + rcu_end_inkernel_boot_locked(); > + mutex_unlock(&rcu_boot_end_lock); > +} > + > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) > +{ > + uint end_ms; > + int ret = kstrtouint(val, 0, &end_ms); > + > + if (ret) > + return ret; > + /* > + * rcu_end_inkernel_boot() should be called at least once during init > + * before we can allow param changes to end the boot. > + */ > + mutex_lock(&rcu_boot_end_lock); > + rcu_boot_end_delay = end_ms; > + if (!rcu_boot_ended && rcu_boot_end_called) { > + rcu_end_inkernel_boot_locked(); > + } > + mutex_unlock(&rcu_boot_end_lock); > + return ret; > +} > + > +static const struct kernel_param_ops rcu_boot_end_ops = { > + .set = param_set_rcu_boot_end, > + .get = param_get_uint, > +}; > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); > + > /* > * Let rcutorture know when it is OK to turn it up to eleven. > */ > -- > 2.40.0.rc0.216.gc4246ad0f0-goog >
On Sat, Mar 11, 2023 at 12:44:53PM -0800, Paul E. McKenney wrote: > On Sat, Mar 04, 2023 at 04:51:45AM +0000, Joel Fernandes wrote: > > Hi Paul, > > > > On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote: > > [..] > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > of patch. An excerpt from the data he shared: > > Now that we have the measurement methodology put to bed... > > [ . . . ] > > > > Mightn't this be simpler if the user was only permitted to write zero, > > > thus just saying "stop immediately"? If people really need the ability > > > to extend or shorten the time, a patch can be produced at that point. > > > And then a non-zero write to the file would become legal. > > > > I prefer to keep it this way as with this method, I can not only get to > > have variable rcu_boot_end_delay via boot parameter (as in my first patch), I > > also don't need to add a separate sysfs entry, and can just reuse > > 'rcu_boot_end_delay' parameter, which I also had in my first patch. And > > adding yet another sysfs parameter will actually complicate it even more and > > add more lines of code. > > > > I tested difference scenarios and it works fine, though I missed that > > mutex locking unfortunately, I did verify different test cases work as > > expected by manual testing. > > Except that you don't need that extra sysfs value. You could instead use > any of a number of state variables that tell you that early boot is done. > If the state says early boot (as in parsing the kernel command line), > make the code act as it does now. Otherwise, make it accept only zero. > > If there really is some system that wants to set one time limit via > the kernel boot parameter and set another at some time during boot, > there are very simple userspace facilities to make this happen. > > And there is also a smaller state space and less testing to be done, > benefits which accrue on an ongoing basis. Ok, thanks for the suggestion and I will consider it when/if posting the next revision of this idea. I got strong pushback from Frederic, Vlad and Steven Rostedt on doing the timeout-based thing, so currently I am analyzing the boot process more to see if it could be optimized instead. I tend to agree with them now also because this feature is new and there could be bugs that this patch might hide.. thanks, - Joel > > Thanx, Paul > > > Here are some printks and on simple testing in Qemu: > > > > 1. End the boot early, CONFIG is set to 120 seconds: > > ================================================== > > [ 1.614968] rcu_boot_end_delay = 120000 > > [ 1.617630] schedule delayed work joel > > > > Boot took 1.57 seconds > > root@(none):/# cat /sys/module/rcupdate/parameters/rcu_boot_end_delay > > 120000 > > root@(none):/# > > root@(none):/# > > root@(none):/# echo 0 > /sys/module/rcupdate/parameters/rcu_boot_end_delay > > [ 10.108394] param called joel > > [ 10.110520] sys calling boot ended > > [ 10.112730] rcu_boot_end_delay = 0 > > [ 10.115017] boot ended joel > > ----------------------------------------------- > > > > 2. End the boot passing in rcupdate.rcu_boot_end_delay as 10s. > > This should overwride the CONFIG of 120 seconds: > > ================================================== > > [ 1.700090] rcu_boot_end_delay = 10000 > > [ 1.702628] schedule delayed work joel > > > > Boot took 1.64 seconds > > > > root@(none):/# [ 10.414008] rcu_boot_end_delay = 10000 > > [ 10.416670] boot ended joel > > ----------------------------------------------- > > > > 3. Do the same thing as #2, but extend the boot via sysfs to be longer than > > 10 seconds: > > ================================================== > > [ 0.060025] param called joel > > [ 0.060026] param called too early joel > > [ 1.663905] rcu_boot_end_delay = 10000 > > [ 1.667051] schedule delayed work joel > > > > Boot took 1.61 seconds > > > > root@(none):/# > > root@(none):/# echo 20000 > /sys/module/rcupdate/parameters/rcu_boot_end_delay > > [ 6.932517] param called joel > > [ 6.934637] sys calling boot ended > > [ 6.936845] rcu_boot_end_delay = 20000 > > [ 6.939291] schedule delayed work joel > > root@(none):/# [ 10.389366] rcu_boot_end_delay = 20000 > > [ 10.392047] schedule delayed work joel > > [ 20.117416] rcu_boot_end_delay = 20000 > > [ 20.120073] boot ended joel > > ----------------------------------------------- > > > > The debug patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/lazy/postboot > > > > Appended is the updated v4 patch, tested as shown above, more testing is in progress. > > > > thanks, > > > > - Joel > > > > ---8<----------------------- > > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org> > > Subject: [PATCH v4] rcu: Add a minimum time for marking boot as completed > > > > On many systems, a great deal of boot (in userspace) happens after the > > kernel thinks the boot has completed. It is difficult to determine if > > the system has really booted from the kernel side. Some features like > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > added that the boot synchronously depends on. Further expedited callbacks > > can get unexpedited way earlier than it should be, thus slowing down > > boot (as shown in the data below). > > > > For these reasons, this commit adds a config option > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > Userspace can also make RCU's view of the system as booted, by writing the > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > Or even just writing a value of 0 to this sysfs node. > > However, under no circumstance will the boot be allowed to end earlier > > than just before init is launched. > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > no config or parameter changes, and just a simple application of this patch. A > > system designer can also choose a specific value here to keep RCU from marking > > boot completion. As noted earlier, RCU's perspective of the system as booted > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > or an update is made via writing a small value (or 0) in milliseconds to: > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > One side-effect of this patch is, there is a risk that a real-time workload > > launched just after the kernel boots will suffer interruptions due to expedited > > RCU, which previous ended just before init was launched. However, to mitigate > > such an issue (however unlikely), the user should either tune > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > boots, and before launching the real-time workload. > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > of patch. An excerpt from the data he shared: > > > > 1) Testing environment: > > OS : CentOS Stream 8 (non-RT OS) > > Kernel : v6.2 > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > 2) OS boot time definition: > > The time from the start of the kernel boot to the shell command line > > prompt is shown from the console. [ Different people may have > > different OS boot time definitions. ] > > > > 3) Measurement method (very rough method): > > A timer in the kernel periodically prints the boot time every 100ms. > > As soon as the shell command line prompt is shown from the console, > > we record the boot time printed by the timer, then the printed boot > > time is the OS boot time. > > > > 4) Measured OS boot time (in seconds) > > a) Measured 10 times w/o this patch: > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > The average OS boot time was: ~8.7s > > > > b) Measure 10 times w/ this patch: > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > The average OS boot time was: ~8.3s. > > > > option-prefix PATCH v4 > > option-start > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > diff-note-start > > v1->v2: > > Update some comments and description. > > v2->v3: > > Add sysfs param, and update with Test data. > > v3->v4: > > Fix locking bug found by Paul, make code more robust > > by refactoring locking code. > > Doc updates. > > --- > > .../admin-guide/kernel-parameters.txt | 15 ++++ > > cc_list | 8 ++ > > kernel/rcu/Kconfig | 21 ++++++ > > kernel/rcu/update.c | 74 ++++++++++++++++++- > > 4 files changed, 116 insertions(+), 2 deletions(-) > > create mode 100644 cc_list > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 2429b5e3184b..878c2780f5db 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5085,6 +5085,21 @@ > > rcutorture.verbose= [KNL] > > Enable additional printk() statements. > > > > + rcupdate.rcu_boot_end_delay= [KNL] > > + Minimum time in milliseconds from the start of boot > > + that must elapse before the boot sequence can be marked > > + complete from RCU's perspective, after which RCU's > > + behavior becomes more relaxed. The default value is also > > + configurable via CONFIG_RCU_BOOT_END_DELAY. > > + Userspace can also mark the boot as completed > > + sooner by writing the time in milliseconds, say once > > + userspace considers the system as booted, to: > > + /sys/module/rcupdate/parameters/rcu_boot_end_delay > > + Or even just writing a value of 0 to this sysfs node. > > + The sysfs node can also be used to extend the delay > > + to be larger than the default, assuming the marking > > + of boot complete has not yet occurred. > > + > > rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] > > Dump ftrace buffer after reporting RCU CPU > > stall warning. > > diff --git a/cc_list b/cc_list > > new file mode 100644 > > index 000000000000..7daed4877f5a > > --- /dev/null > > +++ b/cc_list > > @@ -0,0 +1,8 @@ > > +Frederic Weisbecker <frederic@kernel.org> > > +Joel Fernandes <joel@joelfernandes.org> > > +Lai Jiangshan <jiangshanlai@gmail.com> > > +linux-doc@vger.kernel.org > > +linux-kernel@vger.kernel.org > > +"Paul E. McKenney" <paulmck@kernel.org> > > +rcu@vger.kernel.org > > +urezki@gmail.com > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig > > index 9071182b1284..97f68120d1c0 100644 > > --- a/kernel/rcu/Kconfig > > +++ b/kernel/rcu/Kconfig > > @@ -217,6 +217,27 @@ config RCU_BOOST_DELAY > > > > Accept the default if unsure. > > > > +config RCU_BOOT_END_DELAY > > + int "Minimum time before RCU may consider in-kernel boot as completed" > > + range 0 120000 > > + default 15000 > > + help > > + Default value of the minimum time in milliseconds from the start of boot > > + that must elapse before the boot sequence can be marked complete from RCU's > > + perspective, after which RCU's behavior becomes more relaxed. > > + Userspace can also mark the boot as completed sooner than this default > > + by writing the time in milliseconds, say once userspace considers > > + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > + Or even just writing a value of 0 to this sysfs node. The sysfs node can > > + also be used to extend the delay to be larger than the default, assuming > > + the marking of boot completion has not yet occurred. > > + > > + The actual delay for RCU's view of the system to be marked as booted can be > > + higher than this value if the kernel takes a long time to initialize but it > > + will never be smaller than this value. > > + > > + Accept the default if unsure. > > + > > config RCU_EXP_KTHREAD > > bool "Perform RCU expedited work in a real-time kthread" > > depends on RCU_BOOST && RCU_EXPERT > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > > index 19bf6fa3ee6a..18ed3c15e6b5 100644 > > --- a/kernel/rcu/update.c > > +++ b/kernel/rcu/update.c > > @@ -224,13 +224,50 @@ void rcu_unexpedite_gp(void) > > } > > EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); > > > > +/* > > + * Minimum time in milliseconds from the start boot until RCU can consider > > + * in-kernel boot as completed. This can also be tuned at runtime to end the > > + * boot earlier, by userspace init code writing the time in milliseconds (even > > + * 0) to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. The sysfs node > > + * can also be used to extend the delay to be larger than the default, assuming > > + * the marking of boot complete has not yet occurred. > > + */ > > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; > > + > > static bool rcu_boot_ended __read_mostly; > > +static bool rcu_boot_end_called __read_mostly; > > +static DEFINE_MUTEX(rcu_boot_end_lock); > > > > /* > > - * Inform RCU of the end of the in-kernel boot sequence. > > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will > > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. > > */ > > -void rcu_end_inkernel_boot(void) > > +void rcu_end_inkernel_boot(void); > > +static void rcu_boot_end_work_fn(struct work_struct *work) > > +{ > > + rcu_end_inkernel_boot(); > > +} > > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); > > + > > +/* Must be called with rcu_boot_end_lock held. */ > > +static void rcu_end_inkernel_boot_locked(void) > > { > > + rcu_boot_end_called = true; > > + > > + if (rcu_boot_ended) > > + return; > > + > > + if (rcu_boot_end_delay) { > > + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); > > + > > + if (boot_ms < rcu_boot_end_delay) { > > + schedule_delayed_work(&rcu_boot_end_work, > > + rcu_boot_end_delay - boot_ms); > > + return; > > + } > > + } > > + > > + cancel_delayed_work(&rcu_boot_end_work); > > rcu_unexpedite_gp(); > > rcu_async_relax(); > > if (rcu_normal_after_boot) > > @@ -238,6 +275,39 @@ void rcu_end_inkernel_boot(void) > > rcu_boot_ended = true; > > } > > > > +void rcu_end_inkernel_boot(void) > > +{ > > + mutex_lock(&rcu_boot_end_lock); > > + rcu_end_inkernel_boot_locked(); > > + mutex_unlock(&rcu_boot_end_lock); > > +} > > + > > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) > > +{ > > + uint end_ms; > > + int ret = kstrtouint(val, 0, &end_ms); > > + > > + if (ret) > > + return ret; > > + /* > > + * rcu_end_inkernel_boot() should be called at least once during init > > + * before we can allow param changes to end the boot. > > + */ > > + mutex_lock(&rcu_boot_end_lock); > > + rcu_boot_end_delay = end_ms; > > + if (!rcu_boot_ended && rcu_boot_end_called) { > > + rcu_end_inkernel_boot_locked(); > > + } > > + mutex_unlock(&rcu_boot_end_lock); > > + return ret; > > +} > > + > > +static const struct kernel_param_ops rcu_boot_end_ops = { > > + .set = param_set_rcu_boot_end, > > + .get = param_get_uint, > > +}; > > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); > > + > > /* > > * Let rcutorture know when it is OK to turn it up to eleven. > > */ > > -- > > 2.40.0.rc0.216.gc4246ad0f0-goog > >
On Sat, Mar 11, 2023 at 10:23:54PM +0000, Joel Fernandes wrote: > On Sat, Mar 11, 2023 at 12:44:53PM -0800, Paul E. McKenney wrote: > > On Sat, Mar 04, 2023 at 04:51:45AM +0000, Joel Fernandes wrote: > > > Hi Paul, > > > > > > On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote: > > > [..] > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > of patch. An excerpt from the data he shared: > > > > Now that we have the measurement methodology put to bed... > > > > [ . . . ] > > > > > > Mightn't this be simpler if the user was only permitted to write zero, > > > > thus just saying "stop immediately"? If people really need the ability > > > > to extend or shorten the time, a patch can be produced at that point. > > > > And then a non-zero write to the file would become legal. > > > > > > I prefer to keep it this way as with this method, I can not only get to > > > have variable rcu_boot_end_delay via boot parameter (as in my first patch), I > > > also don't need to add a separate sysfs entry, and can just reuse > > > 'rcu_boot_end_delay' parameter, which I also had in my first patch. And > > > adding yet another sysfs parameter will actually complicate it even more and > > > add more lines of code. > > > > > > I tested difference scenarios and it works fine, though I missed that > > > mutex locking unfortunately, I did verify different test cases work as > > > expected by manual testing. > > > > Except that you don't need that extra sysfs value. You could instead use > > any of a number of state variables that tell you that early boot is done. > > If the state says early boot (as in parsing the kernel command line), > > make the code act as it does now. Otherwise, make it accept only zero. > > > > If there really is some system that wants to set one time limit via > > the kernel boot parameter and set another at some time during boot, > > there are very simple userspace facilities to make this happen. > > > > And there is also a smaller state space and less testing to be done, > > benefits which accrue on an ongoing basis. > > Ok, thanks for the suggestion and I will consider it when/if posting the next > revision of this idea. I got strong pushback from Frederic, Vlad and Steven > Rostedt on doing the timeout-based thing, so currently I am analyzing the > boot process more to see if it could be optimized instead. I tend to agree > with them now also because this feature is new and there could be bugs that > this patch might hide.. Agreed, fixing underlying causes is even better. Thanx, Paul
On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > [..] > > > > > > > > > See this commit: > > > > > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > > > find that Android Mediatek devices at least are setting > > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > > > where he talks about expediting grace periods but not unexpediting > > > > > > > them. > > > > > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > > > have done some app-launch time analysis with enabling and disabling of it. > > > > > > > > An expedited case is much better when it comes to app launch time. It > > > > requires ~25% less time to run an app comparing with unexpedited variant. > > > > So we have a big gain here. > > > > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > be slowing down other usecases! I find it hard to believe, real-time > > > workloads will run better without those callbacks being always-expedited if > > > it actually gives back 25% in performance! > > > > > I can dig further, but on a high level i think there are some spots > > which show better performance if expedited is set. I mean synchronize_rcu() > > becomes as "less blocking a context" from a time point of view. > > > > The problem of a regular synchronize_rcu() is - it can trigger a big latency > > delays for a caller. For example for nocb case we do not know where in a list > > our callback is located and when it is invoked to unblock a caller. > > True, expedited RCU grace periods do not have this callback-invocation > delay that normal RCU does. > > > I have already mentioned somewhere. Probably it makes sense to directly wake-up > > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > one by one. > > Makes sense, but it is necessary to be careful. Wakeups are not fast, > so making the RCU grace-period kthread do them all sequentially is not > a strategy to win. For example, note that the next expedited grace > period can start before the previous expedited grace period has finished > its wakeups. > I hove done a small and quick prototype: <snip> diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h index 699b938358bf..e1a4cca9a208 100644 --- a/include/linux/rcupdate_wait.h +++ b/include/linux/rcupdate_wait.h @@ -9,6 +9,8 @@ #include <linux/rcupdate.h> #include <linux/completion.h> +extern struct llist_head gp_wait_llist; + /* * Structure allowing asynchronous waiting on RCU. */ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ee27a03d7576..50b81ca54104 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ +/* Waiters for a GP kthread. */ +LLIST_HEAD(gp_wait_llist); + /* * The rcu_scheduler_active variable is initialized to the value * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) on_each_cpu(rcu_strict_gp_boundary, NULL, 0); } +static void rcu_notify_gp_end(struct llist_node *llist) +{ + struct llist_node *rcu, *next; + + llist_for_each_safe(rcu, next, llist) + complete(&((struct rcu_synchronize *) rcu)->completion); +} + /* * Body of kthread that handles grace periods. */ @@ -1811,6 +1822,9 @@ static int __noreturn rcu_gp_kthread(void *unused) WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP); rcu_gp_cleanup(); WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED); + + /* Wake-app all users. */ + rcu_notify_gp_end(llist_del_all(&gp_wait_llist)); } } diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 19bf6fa3ee6a..1de7c328a3e5 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, if (j == i) { init_rcu_head_on_stack(&rs_array[i].head); init_completion(&rs_array[i].completion); - (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu); + + /* Kick a grace period if needed. */ + (void) start_poll_synchronize_rcu(); + llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist); } } <snip> and did some experiments in terms of performance and comparison. A test case is: thread_X: synchronize_rcu(); kfree(ptr); below are results with running 10 parallel workers running 1000 times of mentioned test scenario: # default(NOCB) [ 29.322944] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17286604 usec [ 29.325759] All test took worker0=63964052068 cycles [ 29.327255] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23414575 usec [ 29.329974] All test took worker1=86638822563 cycles [ 29.331460] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23357988 usec [ 29.334205] All test took worker2=86429439193 cycles [ 29.350808] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17174001 usec [ 29.353553] All test took worker3=63547397954 cycles [ 29.355039] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17141904 usec [ 29.357770] All test took worker4=63428630877 cycles [ 29.374831] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23397952 usec [ 29.377577] All test took worker5=86577316353 cycles [ 29.398809] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17142038 usec [ 29.401549] All test took worker6=63429124938 cycles [ 29.414828] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17158248 usec [ 29.417574] All test took worker7=63489107118 cycles [ 29.438811] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 18102109 usec [ 29.441550] All test took worker8=66981588881 cycles [ 29.462826] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23446042 usec [ 29.465561] All test took worker9=86755258455 cycles # patch(NOCB) [ 14.720986] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837883 usec [ 14.723753] All test took worker0=32702015768 cycles [ 14.740386] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec [ 14.743076] All test took worker1=32701525814 cycles [ 14.760350] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837734 usec [ 14.763036] All test took worker2=32701466281 cycles [ 14.780369] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837707 usec [ 14.783057] All test took worker3=32701364901 cycles [ 14.800352] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837730 usec [ 14.803041] All test took worker4=32701449927 cycles [ 14.820355] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837724 usec [ 14.823048] All test took worker5=32701428134 cycles [ 14.840359] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837705 usec [ 14.843052] All test took worker6=32701356465 cycles [ 14.860322] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837742 usec [ 14.863005] All test took worker7=32701494475 cycles [ 14.880363] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec [ 14.883081] All test took worker8=32701525074 cycles [ 14.900362] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837918 usec [ 14.903065] All test took worker9=32702145379 cycles -- Uladzislau Rezki
On Mon, Mar 13, 2023 at 10:51:39AM +0100, Uladzislau Rezki wrote: > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > > [..] > > > > > > > > > > See this commit: > > > > > > > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > > > > find that Android Mediatek devices at least are setting > > > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > > > > where he talks about expediting grace periods but not unexpediting > > > > > > > > them. > > > > > > > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > > > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > > > > have done some app-launch time analysis with enabling and disabling of it. > > > > > > > > > > An expedited case is much better when it comes to app launch time. It > > > > > requires ~25% less time to run an app comparing with unexpedited variant. > > > > > So we have a big gain here. > > > > > > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > > be slowing down other usecases! I find it hard to believe, real-time > > > > workloads will run better without those callbacks being always-expedited if > > > > it actually gives back 25% in performance! > > > > > > > I can dig further, but on a high level i think there are some spots > > > which show better performance if expedited is set. I mean synchronize_rcu() > > > becomes as "less blocking a context" from a time point of view. > > > > > > The problem of a regular synchronize_rcu() is - it can trigger a big latency > > > delays for a caller. For example for nocb case we do not know where in a list > > > our callback is located and when it is invoked to unblock a caller. > > > > True, expedited RCU grace periods do not have this callback-invocation > > delay that normal RCU does. > > > > > I have already mentioned somewhere. Probably it makes sense to directly wake-up > > > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > > one by one. > > > > Makes sense, but it is necessary to be careful. Wakeups are not fast, > > so making the RCU grace-period kthread do them all sequentially is not > > a strategy to win. For example, note that the next expedited grace > > period can start before the previous expedited grace period has finished > > its wakeups. > > > I hove done a small and quick prototype: > > <snip> > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > index 699b938358bf..e1a4cca9a208 100644 > --- a/include/linux/rcupdate_wait.h > +++ b/include/linux/rcupdate_wait.h > @@ -9,6 +9,8 @@ > #include <linux/rcupdate.h> > #include <linux/completion.h> > > +extern struct llist_head gp_wait_llist; > + > /* > * Structure allowing asynchronous waiting on RCU. > */ > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index ee27a03d7576..50b81ca54104 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > +/* Waiters for a GP kthread. */ > +LLIST_HEAD(gp_wait_llist); > + > /* > * The rcu_scheduler_active variable is initialized to the value > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > } > > +static void rcu_notify_gp_end(struct llist_node *llist) > +{ > + struct llist_node *rcu, *next; > + > + llist_for_each_safe(rcu, next, llist) > + complete(&((struct rcu_synchronize *) rcu)->completion); > +} > + > /* > * Body of kthread that handles grace periods. > */ > @@ -1811,6 +1822,9 @@ static int __noreturn rcu_gp_kthread(void *unused) > WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP); > rcu_gp_cleanup(); > WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED); > + > + /* Wake-app all users. */ > + rcu_notify_gp_end(llist_del_all(&gp_wait_llist)); > } > } > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 19bf6fa3ee6a..1de7c328a3e5 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, > if (j == i) { > init_rcu_head_on_stack(&rs_array[i].head); > init_completion(&rs_array[i].completion); > - (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu); > + > + /* Kick a grace period if needed. */ > + (void) start_poll_synchronize_rcu(); > + llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist); > } > } > <snip> > > and did some experiments in terms of performance and comparison. A test case is: > > thread_X: > synchronize_rcu(); > kfree(ptr); > > below are results with running 10 parallel workers running 1000 times of mentioned > test scenario: > > # default(NOCB) > [ 29.322944] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17286604 usec > [ 29.325759] All test took worker0=63964052068 cycles > [ 29.327255] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23414575 usec > [ 29.329974] All test took worker1=86638822563 cycles > [ 29.331460] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23357988 usec > [ 29.334205] All test took worker2=86429439193 cycles > [ 29.350808] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17174001 usec > [ 29.353553] All test took worker3=63547397954 cycles > [ 29.355039] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17141904 usec > [ 29.357770] All test took worker4=63428630877 cycles > [ 29.374831] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23397952 usec > [ 29.377577] All test took worker5=86577316353 cycles > [ 29.398809] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17142038 usec > [ 29.401549] All test took worker6=63429124938 cycles > [ 29.414828] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17158248 usec > [ 29.417574] All test took worker7=63489107118 cycles > [ 29.438811] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 18102109 usec > [ 29.441550] All test took worker8=66981588881 cycles > [ 29.462826] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23446042 usec > [ 29.465561] All test took worker9=86755258455 cycles > > # patch(NOCB) > [ 14.720986] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837883 usec > [ 14.723753] All test took worker0=32702015768 cycles > [ 14.740386] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec > [ 14.743076] All test took worker1=32701525814 cycles > [ 14.760350] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837734 usec > [ 14.763036] All test took worker2=32701466281 cycles > [ 14.780369] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837707 usec > [ 14.783057] All test took worker3=32701364901 cycles > [ 14.800352] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837730 usec > [ 14.803041] All test took worker4=32701449927 cycles > [ 14.820355] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837724 usec > [ 14.823048] All test took worker5=32701428134 cycles > [ 14.840359] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837705 usec > [ 14.843052] All test took worker6=32701356465 cycles > [ 14.860322] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837742 usec > [ 14.863005] All test took worker7=32701494475 cycles > [ 14.880363] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec > [ 14.883081] All test took worker8=32701525074 cycles > [ 14.900362] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837918 usec > [ 14.903065] All test took worker9=32702145379 cycles > > -- > Uladzislau Rezki A quick app launch test. This is a camera app on our device: urezki@pc636:~/data/yoshino_bin/scripts$ ./test-cam.sh 629 572 652 622 642 650 613 654 607 urezki@pc636:~/data/yoshino_bin/scripts$ adb shell XQ-DQ54:/ $ su XQ-DQ54:/ # echo 1 > /sy sys/ system/ system_dlkm/ system_ext/ XQ-DQ54:/ # echo 1 > /sys/kernel/rc rcu_expedited rcu_improve_normal rcu_normal XQ-DQ54:/ # echo 1 > /sys/kernel/rcu_improve_normal XQ-DQ54:/ # exit XQ-DQ54:/ $ exit urezki@pc636:~/data/yoshino_bin/scripts$ ./test-cam.sh 533 549 563 537 540 563 531 549 548 urezki@pc636:~/data/yoshino_bin/scripts$ the taken time to run an app in milliseconds. -- Uladzislau Rezki
> From: Uladzislau Rezki <urezki@gmail.com> > [...] > XQ-DQ54:/ # echo 1 > /sys/kernel/rc > rcu_expedited rcu_improve_normal rcu_normal > XQ-DQ54:/ # echo 1 > /sys/kernel/rcu_improve_normal Hi Rezki, I applied your prototype patch, but I did NOT find the sys-node: "/sys/kernel/rcu_improve_normal" on my system. What is this node used for? What am I missing? Thanks! [ There were only "rcu_expedited" & " rcu_normal" sys nodes on my system. ] -Qiuxu > XQ-DQ54:/ # exit > XQ-DQ54:/ $ exit urezki@pc636:~/data/yoshino_bin/scripts$ ./test-cam.sh > 533 > 549 > 563 > 537 > 540 > 563 > 531 > 549 > 548 > urezki@pc636:~/data/yoshino_bin/scripts$ > > the taken time to run an app in milliseconds. > > -- > Uladzislau Rezki
> On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: >>>> [..] >>>>>>>>>> See this commit: >>>>>>>>>> >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of >>>>>>>>>> expedited RCU primitives") >>>>>>>>>> >>>>>>>>>> Antti provided this commit precisely in order to allow Android >>>>>>>>>> devices to expedite the boot process and to shut off the >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android >>>>>>>>>> has been making this work for about ten years, which strikes me >>>>>>>>>> as an adequate proof of concept. ;-) >>>>>>>>> >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I >>>>>>>>> find that Android Mediatek devices at least are setting >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is >>>>>>>>> weird, it should be set to 1 as early as possible), and >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P >>>>>>>> >>>>>>>> Interesting. Though this is consistent with Antti's commit log, >>>>>>>> where he talks about expediting grace periods but not unexpediting >>>>>>>> them. >>>>>>>> >>>>>>> Do you think we need to unexpedite it? :)))) >>>>>> >>>>>> Android runs on smallish systems, so quite possibly not! >>>>>> >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I >>>>> have done some app-launch time analysis with enabling and disabling of it. >>>>> >>>>> An expedited case is much better when it comes to app launch time. It >>>>> requires ~25% less time to run an app comparing with unexpedited variant. >>>>> So we have a big gain here. >>>> >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could >>>> be slowing down other usecases! I find it hard to believe, real-time >>>> workloads will run better without those callbacks being always-expedited if >>>> it actually gives back 25% in performance! >>>> >>> I can dig further, but on a high level i think there are some spots >>> which show better performance if expedited is set. I mean synchronize_rcu() >>> becomes as "less blocking a context" from a time point of view. >>> >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency >>> delays for a caller. For example for nocb case we do not know where in a list >>> our callback is located and when it is invoked to unblock a caller. >> >> True, expedited RCU grace periods do not have this callback-invocation >> delay that normal RCU does. >> >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks >>> one by one. >> >> Makes sense, but it is necessary to be careful. Wakeups are not fast, >> so making the RCU grace-period kthread do them all sequentially is not >> a strategy to win. For example, note that the next expedited grace >> period can start before the previous expedited grace period has finished >> its wakeups. >> > I hove done a small and quick prototype: > > <snip> > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > index 699b938358bf..e1a4cca9a208 100644 > --- a/include/linux/rcupdate_wait.h > +++ b/include/linux/rcupdate_wait.h > @@ -9,6 +9,8 @@ > #include <linux/rcupdate.h> > #include <linux/completion.h> > > +extern struct llist_head gp_wait_llist; > + > /* > * Structure allowing asynchronous waiting on RCU. > */ > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index ee27a03d7576..50b81ca54104 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > +/* Waiters for a GP kthread. */ > +LLIST_HEAD(gp_wait_llist); > + > /* > * The rcu_scheduler_active variable is initialized to the value > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > } > > +static void rcu_notify_gp_end(struct llist_node *llist) > +{ > + struct llist_node *rcu, *next; > + > + llist_for_each_safe(rcu, next, llist) > + complete(&((struct rcu_synchronize *) rcu)->completion); This looks broken to me, so the synchronize will complete even if it was called in the middle of an ongoing GP? Thanks, - Joel > +} > + > /* > * Body of kthread that handles grace periods. > */ > @@ -1811,6 +1822,9 @@ static int __noreturn rcu_gp_kthread(void *unused) > WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP); > rcu_gp_cleanup(); > WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED); > + > + /* Wake-app all users. */ > + rcu_notify_gp_end(llist_del_all(&gp_wait_llist)); > } > } > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 19bf6fa3ee6a..1de7c328a3e5 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, > if (j == i) { > init_rcu_head_on_stack(&rs_array[i].head); > init_completion(&rs_array[i].completion); > - (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu); > + > + /* Kick a grace period if needed. */ > + (void) start_poll_synchronize_rcu(); > + llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist); > } > } > <snip> > > and did some experiments in terms of performance and comparison. A test case is: > > thread_X: > synchronize_rcu(); > kfree(ptr); > > below are results with running 10 parallel workers running 1000 times of mentioned > test scenario: > > # default(NOCB) > [ 29.322944] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17286604 usec > [ 29.325759] All test took worker0=63964052068 cycles > [ 29.327255] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23414575 usec > [ 29.329974] All test took worker1=86638822563 cycles > [ 29.331460] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23357988 usec > [ 29.334205] All test took worker2=86429439193 cycles > [ 29.350808] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17174001 usec > [ 29.353553] All test took worker3=63547397954 cycles > [ 29.355039] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17141904 usec > [ 29.357770] All test took worker4=63428630877 cycles > [ 29.374831] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23397952 usec > [ 29.377577] All test took worker5=86577316353 cycles > [ 29.398809] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17142038 usec > [ 29.401549] All test took worker6=63429124938 cycles > [ 29.414828] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17158248 usec > [ 29.417574] All test took worker7=63489107118 cycles > [ 29.438811] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 18102109 usec > [ 29.441550] All test took worker8=66981588881 cycles > [ 29.462826] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23446042 usec > [ 29.465561] All test took worker9=86755258455 cycles > > # patch(NOCB) > [ 14.720986] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837883 usec > [ 14.723753] All test took worker0=32702015768 cycles > [ 14.740386] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec > [ 14.743076] All test took worker1=32701525814 cycles > [ 14.760350] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837734 usec > [ 14.763036] All test took worker2=32701466281 cycles > [ 14.780369] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837707 usec > [ 14.783057] All test took worker3=32701364901 cycles > [ 14.800352] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837730 usec > [ 14.803041] All test took worker4=32701449927 cycles > [ 14.820355] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837724 usec > [ 14.823048] All test took worker5=32701428134 cycles > [ 14.840359] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837705 usec > [ 14.843052] All test took worker6=32701356465 cycles > [ 14.860322] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837742 usec > [ 14.863005] All test took worker7=32701494475 cycles > [ 14.880363] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec > [ 14.883081] All test took worker8=32701525074 cycles > [ 14.900362] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837918 usec > [ 14.903065] All test took worker9=32702145379 cycles > > -- > Uladzislau Rezki
On Mon, Mar 13, 2023 at 01:48:18PM +0000, Zhuo, Qiuxu wrote: > > From: Uladzislau Rezki <urezki@gmail.com> > > [...] > > XQ-DQ54:/ # echo 1 > /sys/kernel/rc > > rcu_expedited rcu_improve_normal rcu_normal > > XQ-DQ54:/ # echo 1 > /sys/kernel/rcu_improve_normal > > Hi Rezki, > > I applied your prototype patch, but I did NOT find the sys-node: > "/sys/kernel/rcu_improve_normal" on my system. > > What is this node used for? What am I missing? Thanks! > > [ There were only "rcu_expedited" & " rcu_normal" sys nodes > on my system. ] > The prototype i posted does not have such helper, i added it just for my local tests. -- Uladzislau Rezki
On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > >>>> [..] > >>>>>>>>>> See this commit: > >>>>>>>>>> > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > >>>>>>>>>> expedited RCU primitives") > >>>>>>>>>> > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > >>>>>>>>>> devices to expedite the boot process and to shut off the > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > >>>>>>>>>> has been making this work for about ten years, which strikes me > >>>>>>>>>> as an adequate proof of concept. ;-) > >>>>>>>>> > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > >>>>>>>>> find that Android Mediatek devices at least are setting > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > >>>>>>>>> weird, it should be set to 1 as early as possible), and > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > >>>>>>>> > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > >>>>>>>> where he talks about expediting grace periods but not unexpediting > >>>>>>>> them. > >>>>>>>> > >>>>>>> Do you think we need to unexpedite it? :)))) > >>>>>> > >>>>>> Android runs on smallish systems, so quite possibly not! > >>>>>> > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > >>>>> have done some app-launch time analysis with enabling and disabling of it. > >>>>> > >>>>> An expedited case is much better when it comes to app launch time. It > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > >>>>> So we have a big gain here. > >>>> > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > >>>> be slowing down other usecases! I find it hard to believe, real-time > >>>> workloads will run better without those callbacks being always-expedited if > >>>> it actually gives back 25% in performance! > >>>> > >>> I can dig further, but on a high level i think there are some spots > >>> which show better performance if expedited is set. I mean synchronize_rcu() > >>> becomes as "less blocking a context" from a time point of view. > >>> > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > >>> delays for a caller. For example for nocb case we do not know where in a list > >>> our callback is located and when it is invoked to unblock a caller. > >> > >> True, expedited RCU grace periods do not have this callback-invocation > >> delay that normal RCU does. > >> > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > >>> one by one. > >> > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > >> so making the RCU grace-period kthread do them all sequentially is not > >> a strategy to win. For example, note that the next expedited grace > >> period can start before the previous expedited grace period has finished > >> its wakeups. > >> > > I hove done a small and quick prototype: > > > > <snip> > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > index 699b938358bf..e1a4cca9a208 100644 > > --- a/include/linux/rcupdate_wait.h > > +++ b/include/linux/rcupdate_wait.h > > @@ -9,6 +9,8 @@ > > #include <linux/rcupdate.h> > > #include <linux/completion.h> > > > > +extern struct llist_head gp_wait_llist; > > + > > /* > > * Structure allowing asynchronous waiting on RCU. > > */ > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index ee27a03d7576..50b81ca54104 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > +/* Waiters for a GP kthread. */ > > +LLIST_HEAD(gp_wait_llist); > > + > > /* > > * The rcu_scheduler_active variable is initialized to the value > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > } > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > +{ > > + struct llist_node *rcu, *next; > > + > > + llist_for_each_safe(rcu, next, llist) > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > This looks broken to me, so the synchronize will complete even > if it was called in the middle of an ongoing GP? > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new GP sequence can be initiated? -- Uladzislau Rezki
On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > >>>> [..] > > >>>>>>>>>> See this commit: > > >>>>>>>>>> > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > >>>>>>>>>> expedited RCU primitives") > > >>>>>>>>>> > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > >>>>>>>>>> as an adequate proof of concept. ;-) > > >>>>>>>>> > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > >>>>>>>>> find that Android Mediatek devices at least are setting > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > >>>>>>>> > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > >>>>>>>> them. > > >>>>>>>> > > >>>>>>> Do you think we need to unexpedite it? :)))) > > >>>>>> > > >>>>>> Android runs on smallish systems, so quite possibly not! > > >>>>>> > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > >>>>> > > >>>>> An expedited case is much better when it comes to app launch time. It > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > >>>>> So we have a big gain here. > > >>>> > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > >>>> workloads will run better without those callbacks being always-expedited if > > >>>> it actually gives back 25% in performance! > > >>>> > > >>> I can dig further, but on a high level i think there are some spots > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > >>> becomes as "less blocking a context" from a time point of view. > > >>> > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > >>> delays for a caller. For example for nocb case we do not know where in a list > > >>> our callback is located and when it is invoked to unblock a caller. > > >> > > >> True, expedited RCU grace periods do not have this callback-invocation > > >> delay that normal RCU does. > > >> > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > >>> one by one. > > >> > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > >> so making the RCU grace-period kthread do them all sequentially is not > > >> a strategy to win. For example, note that the next expedited grace > > >> period can start before the previous expedited grace period has finished > > >> its wakeups. > > >> > > > I hove done a small and quick prototype: > > > > > > <snip> > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > index 699b938358bf..e1a4cca9a208 100644 > > > --- a/include/linux/rcupdate_wait.h > > > +++ b/include/linux/rcupdate_wait.h > > > @@ -9,6 +9,8 @@ > > > #include <linux/rcupdate.h> > > > #include <linux/completion.h> > > > > > > +extern struct llist_head gp_wait_llist; > > > + > > > /* > > > * Structure allowing asynchronous waiting on RCU. > > > */ > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index ee27a03d7576..50b81ca54104 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > +/* Waiters for a GP kthread. */ > > > +LLIST_HEAD(gp_wait_llist); > > > + > > > /* > > > * The rcu_scheduler_active variable is initialized to the value > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > } > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > > +{ > > > + struct llist_node *rcu, *next; > > > + > > > + llist_for_each_safe(rcu, next, llist) > > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > > > This looks broken to me, so the synchronize will complete even > > if it was called in the middle of an ongoing GP? > > > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new > GP sequence can be initiated? I guess I mean rcu_notify_gp_end() is called at the end of the current grace period, which might be the grace period which started _before_ the synchronize_rcu() was called. So the callback needs to be invoked after the end of the next grace period, not the current one. Did I miss some part of your patch that is handling this? thanks, - Joel
On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote: > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > >>>> [..] > > > >>>>>>>>>> See this commit: > > > >>>>>>>>>> > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > >>>>>>>>>> expedited RCU primitives") > > > >>>>>>>>>> > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > > >>>>>>>>>> as an adequate proof of concept. ;-) > > > >>>>>>>>> > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > > >>>>>>>>> find that Android Mediatek devices at least are setting > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > >>>>>>>> > > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > > >>>>>>>> them. > > > >>>>>>>> > > > >>>>>>> Do you think we need to unexpedite it? :)))) > > > >>>>>> > > > >>>>>> Android runs on smallish systems, so quite possibly not! > > > >>>>>> > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > > >>>>> > > > >>>>> An expedited case is much better when it comes to app launch time. It > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > > >>>>> So we have a big gain here. > > > >>>> > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > > >>>> workloads will run better without those callbacks being always-expedited if > > > >>>> it actually gives back 25% in performance! > > > >>>> > > > >>> I can dig further, but on a high level i think there are some spots > > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > > >>> becomes as "less blocking a context" from a time point of view. > > > >>> > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > > >>> delays for a caller. For example for nocb case we do not know where in a list > > > >>> our callback is located and when it is invoked to unblock a caller. > > > >> > > > >> True, expedited RCU grace periods do not have this callback-invocation > > > >> delay that normal RCU does. > > > >> > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > > >>> one by one. > > > >> > > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > > >> so making the RCU grace-period kthread do them all sequentially is not > > > >> a strategy to win. For example, note that the next expedited grace > > > >> period can start before the previous expedited grace period has finished > > > >> its wakeups. > > > >> > > > > I hove done a small and quick prototype: > > > > > > > > <snip> > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > > index 699b938358bf..e1a4cca9a208 100644 > > > > --- a/include/linux/rcupdate_wait.h > > > > +++ b/include/linux/rcupdate_wait.h > > > > @@ -9,6 +9,8 @@ > > > > #include <linux/rcupdate.h> > > > > #include <linux/completion.h> > > > > > > > > +extern struct llist_head gp_wait_llist; > > > > + > > > > /* > > > > * Structure allowing asynchronous waiting on RCU. > > > > */ > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index ee27a03d7576..50b81ca54104 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > > > +/* Waiters for a GP kthread. */ > > > > +LLIST_HEAD(gp_wait_llist); > > > > + > > > > /* > > > > * The rcu_scheduler_active variable is initialized to the value > > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > > } > > > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > > > +{ > > > > + struct llist_node *rcu, *next; > > > > + > > > > + llist_for_each_safe(rcu, next, llist) > > > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > > > > > This looks broken to me, so the synchronize will complete even > > > if it was called in the middle of an ongoing GP? > > > > > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new > > GP sequence can be initiated? > > I guess I mean rcu_notify_gp_end() is called at the end of the current > grace period, which might be the grace period which started _before_ > the synchronize_rcu() was called. So the callback needs to be invoked > after the end of the next grace period, not the current one. > > Did I miss some part of your patch that is handling this? > No, you did not! That was my fault in placing llist_del_all() into inappropriate place. We have to guarantee a full grace period. But this is a prototype and kind of kick off :) -- Uladzislau Rezki
On Mon, Mar 13, 2023 at 07:12:07PM +0100, Uladzislau Rezki wrote: > On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote: > > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > > >>>> [..] > > > > >>>>>>>>>> See this commit: > > > > >>>>>>>>>> > > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > >>>>>>>>>> expedited RCU primitives") > > > > >>>>>>>>>> > > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > > > >>>>>>>>>> as an adequate proof of concept. ;-) > > > > >>>>>>>>> > > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > > > >>>>>>>>> find that Android Mediatek devices at least are setting > > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > >>>>>>>> > > > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > > > >>>>>>>> them. > > > > >>>>>>>> > > > > >>>>>>> Do you think we need to unexpedite it? :)))) > > > > >>>>>> > > > > >>>>>> Android runs on smallish systems, so quite possibly not! > > > > >>>>>> > > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > > > >>>>> > > > > >>>>> An expedited case is much better when it comes to app launch time. It > > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > > > >>>>> So we have a big gain here. > > > > >>>> > > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > > > >>>> workloads will run better without those callbacks being always-expedited if > > > > >>>> it actually gives back 25% in performance! > > > > >>>> > > > > >>> I can dig further, but on a high level i think there are some spots > > > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > > > >>> becomes as "less blocking a context" from a time point of view. > > > > >>> > > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > > > >>> delays for a caller. For example for nocb case we do not know where in a list > > > > >>> our callback is located and when it is invoked to unblock a caller. > > > > >> > > > > >> True, expedited RCU grace periods do not have this callback-invocation > > > > >> delay that normal RCU does. > > > > >> > > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > > > >>> one by one. > > > > >> > > > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > > > >> so making the RCU grace-period kthread do them all sequentially is not > > > > >> a strategy to win. For example, note that the next expedited grace > > > > >> period can start before the previous expedited grace period has finished > > > > >> its wakeups. > > > > >> > > > > > I hove done a small and quick prototype: > > > > > > > > > > <snip> > > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > > > index 699b938358bf..e1a4cca9a208 100644 > > > > > --- a/include/linux/rcupdate_wait.h > > > > > +++ b/include/linux/rcupdate_wait.h > > > > > @@ -9,6 +9,8 @@ > > > > > #include <linux/rcupdate.h> > > > > > #include <linux/completion.h> > > > > > > > > > > +extern struct llist_head gp_wait_llist; > > > > > + > > > > > /* > > > > > * Structure allowing asynchronous waiting on RCU. > > > > > */ > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > index ee27a03d7576..50b81ca54104 100644 > > > > > --- a/kernel/rcu/tree.c > > > > > +++ b/kernel/rcu/tree.c > > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > > > > > +/* Waiters for a GP kthread. */ > > > > > +LLIST_HEAD(gp_wait_llist); This being a single global will of course fail due to memory contention on large systems. So a patch that is ready for mainline must either have per-rcu_node-structure lists or similar. > > > > > /* > > > > > * The rcu_scheduler_active variable is initialized to the value > > > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > > > } > > > > > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) And calling this directly from rcu_gp_kthread() is a no-go for large systems because the large number of wakeups will CPU-bound that kthread. Also, it would be better to invoke this from rcu_gp_cleanup(). One option would be to do the wakeups from a workqueue handler. You might also want to have an array of lists indexed by the bottom few bits of the RCU grace-period sequence number. This would reduce the number of spurious wakeups. > > > > > +{ > > > > > + struct llist_node *rcu, *next; > > > > > + > > > > > + llist_for_each_safe(rcu, next, llist) > > > > > + complete(&((struct rcu_synchronize *) rcu)->completion); If you don't eliminate spurious wakeups, it is necessary to do something like checking poll_state_synchronize_rcu() reject those wakeups. Thanx, Paul > > > > > > > > This looks broken to me, so the synchronize will complete even > > > > if it was called in the middle of an ongoing GP? > > > > > > > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new > > > GP sequence can be initiated? > > > > I guess I mean rcu_notify_gp_end() is called at the end of the current > > grace period, which might be the grace period which started _before_ > > the synchronize_rcu() was called. So the callback needs to be invoked > > after the end of the next grace period, not the current one. > > > > Did I miss some part of your patch that is handling this? > > > No, you did not! That was my fault in placing llist_del_all() into > inappropriate place. We have to guarantee a full grace period. But > this is a prototype and kind of kick off :) > > -- > Uladzislau Rezki
On Mon, Mar 13, 2023 at 11:56:34AM -0700, Paul E. McKenney wrote: > On Mon, Mar 13, 2023 at 07:12:07PM +0100, Uladzislau Rezki wrote: > > On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote: > > > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > > > > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > > > >>>> [..] > > > > > >>>>>>>>>> See this commit: > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > >>>>>>>>>> expedited RCU primitives") > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > > > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > > > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > > > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > > > > >>>>>>>>>> as an adequate proof of concept. ;-) > > > > > >>>>>>>>> > > > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > > > > >>>>>>>>> find that Android Mediatek devices at least are setting > > > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > >>>>>>>> > > > > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > > > > >>>>>>>> them. > > > > > >>>>>>>> > > > > > >>>>>>> Do you think we need to unexpedite it? :)))) > > > > > >>>>>> > > > > > >>>>>> Android runs on smallish systems, so quite possibly not! > > > > > >>>>>> > > > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > > > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > > > > >>>>> > > > > > >>>>> An expedited case is much better when it comes to app launch time. It > > > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > > > > >>>>> So we have a big gain here. > > > > > >>>> > > > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > > > > >>>> workloads will run better without those callbacks being always-expedited if > > > > > >>>> it actually gives back 25% in performance! > > > > > >>>> > > > > > >>> I can dig further, but on a high level i think there are some spots > > > > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > > > > >>> becomes as "less blocking a context" from a time point of view. > > > > > >>> > > > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > > > > >>> delays for a caller. For example for nocb case we do not know where in a list > > > > > >>> our callback is located and when it is invoked to unblock a caller. > > > > > >> > > > > > >> True, expedited RCU grace periods do not have this callback-invocation > > > > > >> delay that normal RCU does. > > > > > >> > > > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > > > > >>> one by one. > > > > > >> > > > > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > > > > >> so making the RCU grace-period kthread do them all sequentially is not > > > > > >> a strategy to win. For example, note that the next expedited grace > > > > > >> period can start before the previous expedited grace period has finished > > > > > >> its wakeups. > > > > > >> > > > > > > I hove done a small and quick prototype: > > > > > > > > > > > > <snip> > > > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > > > > index 699b938358bf..e1a4cca9a208 100644 > > > > > > --- a/include/linux/rcupdate_wait.h > > > > > > +++ b/include/linux/rcupdate_wait.h > > > > > > @@ -9,6 +9,8 @@ > > > > > > #include <linux/rcupdate.h> > > > > > > #include <linux/completion.h> > > > > > > > > > > > > +extern struct llist_head gp_wait_llist; > > > > > > + > > > > > > /* > > > > > > * Structure allowing asynchronous waiting on RCU. > > > > > > */ > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > > index ee27a03d7576..50b81ca54104 100644 > > > > > > --- a/kernel/rcu/tree.c > > > > > > +++ b/kernel/rcu/tree.c > > > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > > > > > > > +/* Waiters for a GP kthread. */ > > > > > > +LLIST_HEAD(gp_wait_llist); > > This being a single global will of course fail due to memory contention > on large systems. So a patch that is ready for mainline must either > have per-rcu_node-structure lists or similar. > I agree. This is a prototype and the aim is a proof of concept :) On bigger systems gp can starve if it wake-ups a lot of users. At lease i see that a camera-app improves in terms of launch time. It is around 12% percent. > > > > > > /* > > > > > > * The rcu_scheduler_active variable is initialized to the value > > > > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > > > > } > > > > > > > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > And calling this directly from rcu_gp_kthread() is a no-go for large > systems because the large number of wakeups will CPU-bound that kthread. > Also, it would be better to invoke this from rcu_gp_cleanup(). > > One option would be to do the wakeups from a workqueue handler. > > You might also want to have an array of lists indexed by the bottom few > bits of the RCU grace-period sequence number. This would reduce the > number of spurious wakeups. > > > > > > > +{ > > > > > > + struct llist_node *rcu, *next; > > > > > > + > > > > > > + llist_for_each_safe(rcu, next, llist) > > > > > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > If you don't eliminate spurious wakeups, it is necessary to do something > like checking poll_state_synchronize_rcu() reject those wakeups. > OK. I will come up with some data and figures soon. -- Uladzislau Rezki
On Tue, Mar 14, 2023 at 12:16:51PM +0100, Uladzislau Rezki wrote: > On Mon, Mar 13, 2023 at 11:56:34AM -0700, Paul E. McKenney wrote: > > On Mon, Mar 13, 2023 at 07:12:07PM +0100, Uladzislau Rezki wrote: > > > On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote: > > > > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > > > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > > > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > > > > >>>> [..] > > > > > > >>>>>>>>>> See this commit: > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > >>>>>>>>>> expedited RCU primitives") > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > > > > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > > > > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > > > > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > > > > > >>>>>>>>>> as an adequate proof of concept. ;-) > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > >>>>>>>>> find that Android Mediatek devices at least are setting > > > > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > > > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > >>>>>>>> > > > > > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > > > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > > > > > >>>>>>>> them. > > > > > > >>>>>>>> > > > > > > >>>>>>> Do you think we need to unexpedite it? :)))) > > > > > > >>>>>> > > > > > > >>>>>> Android runs on smallish systems, so quite possibly not! > > > > > > >>>>>> > > > > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > > > > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > > > > > >>>>> > > > > > > >>>>> An expedited case is much better when it comes to app launch time. It > > > > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > > > > > >>>>> So we have a big gain here. > > > > > > >>>> > > > > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > > > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > > > > > >>>> workloads will run better without those callbacks being always-expedited if > > > > > > >>>> it actually gives back 25% in performance! > > > > > > >>>> > > > > > > >>> I can dig further, but on a high level i think there are some spots > > > > > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > > > > > >>> becomes as "less blocking a context" from a time point of view. > > > > > > >>> > > > > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > > > > > >>> delays for a caller. For example for nocb case we do not know where in a list > > > > > > >>> our callback is located and when it is invoked to unblock a caller. > > > > > > >> > > > > > > >> True, expedited RCU grace periods do not have this callback-invocation > > > > > > >> delay that normal RCU does. > > > > > > >> > > > > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > > > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > > > > > >>> one by one. > > > > > > >> > > > > > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > > > > > >> so making the RCU grace-period kthread do them all sequentially is not > > > > > > >> a strategy to win. For example, note that the next expedited grace > > > > > > >> period can start before the previous expedited grace period has finished > > > > > > >> its wakeups. > > > > > > >> > > > > > > > I hove done a small and quick prototype: > > > > > > > > > > > > > > <snip> > > > > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > > > > > index 699b938358bf..e1a4cca9a208 100644 > > > > > > > --- a/include/linux/rcupdate_wait.h > > > > > > > +++ b/include/linux/rcupdate_wait.h > > > > > > > @@ -9,6 +9,8 @@ > > > > > > > #include <linux/rcupdate.h> > > > > > > > #include <linux/completion.h> > > > > > > > > > > > > > > +extern struct llist_head gp_wait_llist; > > > > > > > + > > > > > > > /* > > > > > > > * Structure allowing asynchronous waiting on RCU. > > > > > > > */ > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > > > index ee27a03d7576..50b81ca54104 100644 > > > > > > > --- a/kernel/rcu/tree.c > > > > > > > +++ b/kernel/rcu/tree.c > > > > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > > > > > > > > > +/* Waiters for a GP kthread. */ > > > > > > > +LLIST_HEAD(gp_wait_llist); > > > > This being a single global will of course fail due to memory contention > > on large systems. So a patch that is ready for mainline must either > > have per-rcu_node-structure lists or similar. > > > I agree. This is a prototype and the aim is a proof of concept :) > On bigger systems gp can starve if it wake-ups a lot of users. > > At lease i see that a camera-app improves in terms of launch time. > It is around 12% percent. Understood and agreed, lack of scalablity is OK for a prototype for testing purposes. > > > > > > > /* > > > > > > > * The rcu_scheduler_active variable is initialized to the value > > > > > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > > > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > > > > > } > > > > > > > > > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > > > And calling this directly from rcu_gp_kthread() is a no-go for large > > systems because the large number of wakeups will CPU-bound that kthread. > > Also, it would be better to invoke this from rcu_gp_cleanup(). > > > > One option would be to do the wakeups from a workqueue handler. > > > > You might also want to have an array of lists indexed by the bottom few > > bits of the RCU grace-period sequence number. This would reduce the > > number of spurious wakeups. > > > > > > > > > +{ > > > > > > > + struct llist_node *rcu, *next; > > > > > > > + > > > > > > > + llist_for_each_safe(rcu, next, llist) > > > > > > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > > > If you don't eliminate spurious wakeups, it is necessary to do something > > like checking poll_state_synchronize_rcu() reject those wakeups. > > > OK. > > I will come up with some data and figures soon. Sounds good! Thanx, Paul
On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > >>>> [..] > > >>>>>>>>>> See this commit: > > >>>>>>>>>> > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > >>>>>>>>>> expedited RCU primitives") > > >>>>>>>>>> > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > >>>>>>>>>> as an adequate proof of concept. ;-) > > >>>>>>>>> > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > >>>>>>>>> find that Android Mediatek devices at least are setting > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > >>>>>>>> > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > >>>>>>>> them. > > >>>>>>>> > > >>>>>>> Do you think we need to unexpedite it? :)))) > > >>>>>> > > >>>>>> Android runs on smallish systems, so quite possibly not! > > >>>>>> > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > >>>>> > > >>>>> An expedited case is much better when it comes to app launch time. It > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > >>>>> So we have a big gain here. > > >>>> > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > >>>> workloads will run better without those callbacks being always-expedited if > > >>>> it actually gives back 25% in performance! > > >>>> > > >>> I can dig further, but on a high level i think there are some spots > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > >>> becomes as "less blocking a context" from a time point of view. > > >>> > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > >>> delays for a caller. For example for nocb case we do not know where in a list > > >>> our callback is located and when it is invoked to unblock a caller. > > >> > > >> True, expedited RCU grace periods do not have this callback-invocation > > >> delay that normal RCU does. > > >> > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > >>> one by one. > > >> > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > >> so making the RCU grace-period kthread do them all sequentially is not > > >> a strategy to win. For example, note that the next expedited grace > > >> period can start before the previous expedited grace period has finished > > >> its wakeups. > > >> > > > I hove done a small and quick prototype: > > > > > > <snip> > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > index 699b938358bf..e1a4cca9a208 100644 > > > --- a/include/linux/rcupdate_wait.h > > > +++ b/include/linux/rcupdate_wait.h > > > @@ -9,6 +9,8 @@ > > > #include <linux/rcupdate.h> > > > #include <linux/completion.h> > > > > > > +extern struct llist_head gp_wait_llist; > > > + > > > /* > > > * Structure allowing asynchronous waiting on RCU. > > > */ > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index ee27a03d7576..50b81ca54104 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > +/* Waiters for a GP kthread. */ > > > +LLIST_HEAD(gp_wait_llist); > > > + > > > /* > > > * The rcu_scheduler_active variable is initialized to the value > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > } > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > > +{ > > > + struct llist_node *rcu, *next; > > > + > > > + llist_for_each_safe(rcu, next, llist) > > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > > > This looks broken to me, so the synchronize will complete even > > if it was called in the middle of an ongoing GP? > > > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new > GP sequence can be initiated? It looks interesting, I am happy to try it on ChromeOS once you provide a patch, in case it improves something, even if that is suspend or boot time. I think the main concern I had was if you did not wait for a full grace period (which as you indicated, you would fix), you are not really measuring the long delays that the full grace period can cause so IMHO it is important to only measure once correctness is preserved by the modification. To that end, perhaps having rcutorture pass with your modification could be a vote of confidence before proceeding to performance tests. - Joel
On Wed, Mar 08, 2023 at 11:14:40AM +0100, Uladzislau Rezki wrote: > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote: > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote: > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote: > > > > > On many systems, a great deal of boot (in userspace) happens after the > > > > > kernel thinks the boot has completed. It is difficult to determine if > > > > > the system has really booted from the kernel side. Some features like > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been > > > > > added that the boot synchronously depends on. Further expedited callbacks > > > > > can get unexpedited way earlier than it should be, thus slowing down > > > > > boot (as shown in the data below). > > > > > > > > > > For these reasons, this commit adds a config option > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay. > > > > > Userspace can also make RCU's view of the system as booted, by writing the > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay > > > > > Or even just writing a value of 0 to this sysfs node. > > > > > However, under no circumstance will the boot be allowed to end earlier > > > > > than just before init is launched. > > > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need > > > > > no config or parameter changes, and just a simple application of this patch. A > > > > > system designer can also choose a specific value here to keep RCU from marking > > > > > boot completion. As noted earlier, RCU's perspective of the system as booted > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed > > > > > or an update is made via writing a small value (or 0) in milliseconds to: > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay. > > > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload > > > > > launched just after the kernel boots will suffer interruptions due to expedited > > > > > RCU, which previous ended just before init was launched. However, to mitigate > > > > > such an issue (however unlikely), the user should either tune > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace > > > > > boots, and before launching the real-time workload. > > > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version > > > > > of patch. An excerpt from the data he shared: > > > > > > > > > > 1) Testing environment: > > > > > OS : CentOS Stream 8 (non-RT OS) > > > > > Kernel : v6.2 > > > > > Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads) > > > > > Qemu args : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … > > > > > > > > > > 2) OS boot time definition: > > > > > The time from the start of the kernel boot to the shell command line > > > > > prompt is shown from the console. [ Different people may have > > > > > different OS boot time definitions. ] > > > > > > > > > > 3) Measurement method (very rough method): > > > > > A timer in the kernel periodically prints the boot time every 100ms. > > > > > As soon as the shell command line prompt is shown from the console, > > > > > we record the boot time printed by the timer, then the printed boot > > > > > time is the OS boot time. > > > > > > > > > > 4) Measured OS boot time (in seconds) > > > > > a) Measured 10 times w/o this patch: > > > > > 8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s > > > > > The average OS boot time was: ~8.7s > > > > > > > > > > b) Measure 10 times w/ this patch: > > > > > 8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s > > > > > The average OS boot time was: ~8.3s. > > > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > I still don't really like that: > > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause. > > > > Which RCU write side caller is the source of this slow boot? Some tracepoints > > > > reporting the wait duration within synchronize_rcu() calls between the end of > > > > the kernel boot and the end of userspace boot may be helpful. > > > > > > > > 2) The kernel boot was already covered before this patch so this is about > > > > userspace code calling into the kernel. Is that piece of code also called > > > > after the boot? In that case are we missing a conversion from > > > > synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then > > > > the problem is more general than just boot. > > > > > > > > This needs to be analyzed first and if it happens that the issue really > > > > needs to be fixed with telling the kernel that userspace has completed > > > > booting, eg: because the problem is not in a few callsites that need conversion > > > > to expedited but instead in the accumulation of lots of calls that should stay > > > > as is: > > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code > > > > may run right after the boot. Either you choose a value that is too low > > > > and you miss the optimization or the value is too high and you may break > > > > things. > > > > > > > > 4) This should be fixed the way you did: > > > > a) a kernel parameter like you did > > > > b) The init process (systemd?) tells the kernel when it judges that userspace > > > > has completed booting. > > > > c) Make these interfaces more generic, maybe that information will be useful > > > > outside RCU. For example the kernel parameter should be > > > > "user_booted_reported" and the sysfs (should be sysctl?): > > > > kernel.user_booted = 1 > > > > d) But yuck, this means we must know if the init process supports that... > > > > > > > > For these reasons, let's make sure we know exactly what is going on first. > > > > > > > > Thanks. > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1 > > > parameter that can be used during the boot. For example on our devices > > > to speedup a boot we boot the kernel with rcu_expedited: > > > > > > XQ-DQ54:/ # cat /proc/cmdline > > > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001 > > > XQ-DQ54:/ # > > > > > > then a user space can decides if it is needed or not: > > > > > > <snip> > > > rcu_expedited rcu_normal > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_* > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal > > > XQ-DQ54:/ # > > > <snip> > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with > > > true or false. So we can follow and be aligned with rcu_expedited and > > > rcu_normal parameters. > > > > Speaking of aligning, there is also the automated > > rcu_normal_after_boot boot option correct? I prefer the automated > > option of doing this. So the approach here is not really unprecedented > > and is much more robust than relying on userspace too much (I am ok > > with adding your suggestion *on top* of the automated toggle, but I > > probably would not have ChromeOS use it if the automated way exists). > > Or did I miss something? > > > According to name of the rcu_end_inkernel_boot() function and a place > when it is invoked we can conclude that it marks the end of kernel boot > and it happens before running an "init" process. > > With your patch we change a behavior. The initialization occurs not right > after a kernel is up and running but rather after 15 seconds timeout what > at least does not correspond to a function name. Apart from that an expected > behavior might be different. For example some test-suites or smoke tests, etc. > > Another thought about "automated boot complete" is we do not know from > kernel space when it really completes for user space, because from kernel > space we are done and we can detect it. In this cases a user space is a > right candidate to say when it is ready. > > For example for Android a boot complete happens when a home-screen appears. > For Chrome OS i think there is something similar. There must be a boot complete > event in its init scripts or something similar. > > This is just my thoughts. I do not really mind but i also do not see a high > need in having it. Thanks for your thoughts, perhaps if I am the only one who wants it, then it is a bad idea. Here's some hoping to get some more time this week to dig deeper into this... this week has been crazy on the personal front. thanks, - Joel
On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > [..] > > > > > > > > See this commit: > > > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > > find that Android Mediatek devices at least are setting > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > > where he talks about expediting grace periods but not unexpediting > > > > > > them. > > > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > > have done some app-launch time analysis with enabling and disabling of it. > > > > > > An expedited case is much better when it comes to app launch time. It > > > requires ~25% less time to run an app comparing with unexpedited variant. > > > So we have a big gain here. > > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > be slowing down other usecases! I find it hard to believe, real-time > > workloads will run better without those callbacks being always-expedited if > > it actually gives back 25% in performance! > > > I can dig further, but on a high level i think there are some spots > which show better performance if expedited is set. I mean synchronize_rcu() > becomes as "less blocking a context" from a time point of view. > > The problem of a regular synchronize_rcu() is - it can trigger a big latency > delays for a caller. For example for nocb case we do not know where in a list > our callback is located and when it is invoked to unblock a caller. > > I have already mentioned somewhere. Probably it makes sense to directly wake-up > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > one by one. Looking forward to your optimization, I wonder if to overcome the issue Paul mentioned about wake up overhead, whether it is possible to find out how many tasks there are to wake without much overhead, and for the common case of likely one task to wake up which is doing a synchronize_rcu(), wake that up. But there could be dragons.. thanks, - Joel
On Wed, Mar 15, 2023 at 12:21:48PM +0000, Joel Fernandes wrote: > On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > [..] > > > > > > > > > See this commit: > > > > > > > > > > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > > > > > > > expedited RCU primitives") > > > > > > > > > > > > > > > > > > Antti provided this commit precisely in order to allow Android > > > > > > > > > devices to expedite the boot process and to shut off the > > > > > > > > > expediting at a time of Android userspace's choosing. So Android > > > > > > > > > has been making this work for about ten years, which strikes me > > > > > > > > > as an adequate proof of concept. ;-) > > > > > > > > > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I > > > > > > > > find that Android Mediatek devices at least are setting > > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is > > > > > > > > weird, it should be set to 1 as early as possible), and > > > > > > > > interestingly I cannot find them resetting it back to 0!. Maybe > > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > > > > > > > > > > > > Interesting. Though this is consistent with Antti's commit log, > > > > > > > where he talks about expediting grace periods but not unexpediting > > > > > > > them. > > > > > > > > > > > > > Do you think we need to unexpedite it? :)))) > > > > > > > > > > Android runs on smallish systems, so quite possibly not! > > > > > > > > > We keep it enabled and never unexpedite it. The reason is a performance. I > > > > have done some app-launch time analysis with enabling and disabling of it. > > > > > > > > An expedited case is much better when it comes to app launch time. It > > > > requires ~25% less time to run an app comparing with unexpedited variant. > > > > So we have a big gain here. > > > > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > be slowing down other usecases! I find it hard to believe, real-time > > > workloads will run better without those callbacks being always-expedited if > > > it actually gives back 25% in performance! > > > > > I can dig further, but on a high level i think there are some spots > > which show better performance if expedited is set. I mean synchronize_rcu() > > becomes as "less blocking a context" from a time point of view. > > > > The problem of a regular synchronize_rcu() is - it can trigger a big latency > > delays for a caller. For example for nocb case we do not know where in a list > > our callback is located and when it is invoked to unblock a caller. > > > > I have already mentioned somewhere. Probably it makes sense to directly wake-up > > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > one by one. > > Looking forward to your optimization, I wonder if to overcome the issue Paul > mentioned about wake up overhead, whether it is possible to find out how many > tasks there are to wake without much overhead, and for the common case of > likely one task to wake up which is doing a synchronize_rcu(), wake that up. > But there could be dragons.. A per-rcu_node count of the number of tasks needing wakeups might work. But for best results, there would be an array of such numbers indexed by the low-order bits of the grace-period number (excluding the bottom status bits). The callback-offloading code uses such arrays, for example, though not for counts of sleeping tasks. (There cannot be that many rcuo kthreads per group, so there has been no need to count them.) Thanx, Paul
On Tue, Mar 14, 2023 at 06:44:44PM -0400, Joel Fernandes wrote: > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote: > > > > > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote: > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote: > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote: > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote: > > > >>>> [..] > > > >>>>>>>>>> See this commit: > > > >>>>>>>>>> > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of > > > >>>>>>>>>> expedited RCU primitives") > > > >>>>>>>>>> > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android > > > >>>>>>>>>> devices to expedite the boot process and to shut off the > > > >>>>>>>>>> expediting at a time of Android userspace's choosing. So Android > > > >>>>>>>>>> has been making this work for about ten years, which strikes me > > > >>>>>>>>>> as an adequate proof of concept. ;-) > > > >>>>>>>>> > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I > > > >>>>>>>>> find that Android Mediatek devices at least are setting > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!. Maybe > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P > > > >>>>>>>> > > > >>>>>>>> Interesting. Though this is consistent with Antti's commit log, > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting > > > >>>>>>>> them. > > > >>>>>>>> > > > >>>>>>> Do you think we need to unexpedite it? :)))) > > > >>>>>> > > > >>>>>> Android runs on smallish systems, so quite possibly not! > > > >>>>>> > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance. I > > > >>>>> have done some app-launch time analysis with enabling and disabling of it. > > > >>>>> > > > >>>>> An expedited case is much better when it comes to app launch time. It > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant. > > > >>>>> So we have a big gain here. > > > >>>> > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could > > > >>>> be slowing down other usecases! I find it hard to believe, real-time > > > >>>> workloads will run better without those callbacks being always-expedited if > > > >>>> it actually gives back 25% in performance! > > > >>>> > > > >>> I can dig further, but on a high level i think there are some spots > > > >>> which show better performance if expedited is set. I mean synchronize_rcu() > > > >>> becomes as "less blocking a context" from a time point of view. > > > >>> > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency > > > >>> delays for a caller. For example for nocb case we do not know where in a list > > > >>> our callback is located and when it is invoked to unblock a caller. > > > >> > > > >> True, expedited RCU grace periods do not have this callback-invocation > > > >> delay that normal RCU does. > > > >> > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks > > > >>> one by one. > > > >> > > > >> Makes sense, but it is necessary to be careful. Wakeups are not fast, > > > >> so making the RCU grace-period kthread do them all sequentially is not > > > >> a strategy to win. For example, note that the next expedited grace > > > >> period can start before the previous expedited grace period has finished > > > >> its wakeups. > > > >> > > > > I hove done a small and quick prototype: > > > > > > > > <snip> > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h > > > > index 699b938358bf..e1a4cca9a208 100644 > > > > --- a/include/linux/rcupdate_wait.h > > > > +++ b/include/linux/rcupdate_wait.h > > > > @@ -9,6 +9,8 @@ > > > > #include <linux/rcupdate.h> > > > > #include <linux/completion.h> > > > > > > > > +extern struct llist_head gp_wait_llist; > > > > + > > > > /* > > > > * Structure allowing asynchronous waiting on RCU. > > > > */ > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index ee27a03d7576..50b81ca54104 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > > > > > > > > +/* Waiters for a GP kthread. */ > > > > +LLIST_HEAD(gp_wait_llist); > > > > + > > > > /* > > > > * The rcu_scheduler_active variable is initialized to the value > > > > * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void) > > > > on_each_cpu(rcu_strict_gp_boundary, NULL, 0); > > > > } > > > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist) > > > > +{ > > > > + struct llist_node *rcu, *next; > > > > + > > > > + llist_for_each_safe(rcu, next, llist) > > > > + complete(&((struct rcu_synchronize *) rcu)->completion); > > > > > > This looks broken to me, so the synchronize will complete even > > > if it was called in the middle of an ongoing GP? > > > > > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new > > GP sequence can be initiated? > > It looks interesting, I am happy to try it on ChromeOS once you > provide a patch, in case it improves something, even if that is > suspend or boot time. > > I think the main concern I had was if you did not wait for a full > grace period (which as you indicated, you would fix), you are not > really measuring the long delays that the full grace period can cause > so IMHO it is important to only measure once correctness is preserved > by the modification. To that end, perhaps having rcutorture pass with > your modification could be a vote of confidence before proceeding to > performance tests. > No problem. Please note it is just a proof of concept. Here we go: <snip> diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h index 699b938358bf..e1a4cca9a208 100644 --- a/include/linux/rcupdate_wait.h +++ b/include/linux/rcupdate_wait.h @@ -9,6 +9,8 @@ #include <linux/rcupdate.h> #include <linux/completion.h> +extern struct llist_head gp_wait_llist; + /* * Structure allowing asynchronous waiting on RCU. */ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ee27a03d7576..a35b779471eb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ +/* Waiters for a GP kthread. */ +LLIST_HEAD(gp_wait_llist); + /* * The rcu_scheduler_active variable is initialized to the value * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the @@ -1383,7 +1386,7 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap) /* * Initialize a new grace period. Return false if no grace period required. */ -static noinline_for_stack bool rcu_gp_init(void) +static noinline_for_stack bool rcu_gp_init(struct llist_node **wait_list) { unsigned long flags; unsigned long oldmask; @@ -1409,6 +1412,12 @@ static noinline_for_stack bool rcu_gp_init(void) return false; } + /* + * Snapshot callers of synchronize_rcu() for which + * we guarantee a full grace period to be passed. + */ + *wait_list = llist_del_all(&gp_wait_llist); + /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(); /* Record GP times before starting GP, hence rcu_seq_start(). */ @@ -1776,11 +1785,27 @@ static noinline void rcu_gp_cleanup(void) on_each_cpu(rcu_strict_gp_boundary, NULL, 0); } +static void rcu_notify_gp_end(struct llist_node *llist) +{ + struct llist_node *rcu, *next; + int n = 0; + + llist_for_each_safe(rcu, next, llist) { + complete(&((struct rcu_synchronize *) rcu)->completion); + n++; + } + + if (n) + trace_printk("Awoken %d users.\n", n); +} + /* * Body of kthread that handles grace periods. */ static int __noreturn rcu_gp_kthread(void *unused) { + struct llist_node *wait_list; + rcu_bind_gp_kthread(); for (;;) { @@ -1795,7 +1820,7 @@ static int __noreturn rcu_gp_kthread(void *unused) rcu_gp_torture_wait(); WRITE_ONCE(rcu_state.gp_state, RCU_GP_DONE_GPS); /* Locking provides needed memory barrier. */ - if (rcu_gp_init()) + if (rcu_gp_init(&wait_list)) break; cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); @@ -1811,6 +1836,9 @@ static int __noreturn rcu_gp_kthread(void *unused) WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP); rcu_gp_cleanup(); WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED); + + /* Wake-app synchronize_rcu() users. */ + rcu_notify_gp_end(wait_list); } } diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 19bf6fa3ee6a..483997edd58e 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, if (j == i) { init_rcu_head_on_stack(&rs_array[i].head); init_completion(&rs_array[i].completion); - (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu); + llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist); + + /* Kick a grace period if needed. */ + (void) start_poll_synchronize_rcu(); } } <snip> i do not think that it improves your boot time. My concern and what i would like to fix is: <snip> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 ... <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... <snip> i grabbed that good example(our phone device) where a user of synchronize_rcu() is "un-blocked" as last since its callback was the last in a list. -- Uladzislau Rezki
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2429b5e3184b..611de90d9c13 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5085,6 +5085,18 @@ rcutorture.verbose= [KNL] Enable additional printk() statements. + rcupdate.rcu_boot_end_delay= [KNL] + Minimum time in milliseconds that must elapse + before the boot sequence can be marked complete + from RCU's perspective, after which RCU's behavior + becomes more relaxed. The default value is also + configurable via CONFIG_RCU_BOOT_END_DELAY. + Userspace can also mark the boot as completed + sooner by writing the time in milliseconds, say once + userspace considers the system as booted, to: + /sys/module/rcupdate/parameters/rcu_boot_end_delay + Or even just writing a value of 0 to this sysfs node. + rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] Dump ftrace buffer after reporting RCU CPU stall warning. diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index 9071182b1284..4b5ffa36cbaf 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY Accept the default if unsure. +config RCU_BOOT_END_DELAY + int "Minimum time before RCU may consider in-kernel boot as completed" + range 0 120000 + default 15000 + help + Default value of the minimum time in milliseconds that must elapse + before the boot sequence can be marked complete from RCU's perspective, + after which RCU's behavior becomes more relaxed. + Userspace can also mark the boot as completed sooner than this default + by writing the time in milliseconds, say once userspace considers + the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. + Or even just writing a value of 0 to this sysfs node. + + The actual delay for RCU's view of the system to be marked as booted can be + higher than this value if the kernel takes a long time to initialize but it + will never be smaller than this value. + + Accept the default if unsure. + config RCU_EXP_KTHREAD bool "Perform RCU expedited work in a real-time kthread" depends on RCU_BOOST && RCU_EXPERT diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 19bf6fa3ee6a..93138c92136e 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void) } EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); +/* + * Minimum time in milliseconds until RCU can consider in-kernel boot as + * completed. This can also be tuned at runtime to end the boot earlier, by + * userspace init code writing the time in milliseconds (even 0) to: + * /sys/module/rcupdate/parameters/rcu_boot_end_delay + */ +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY; + static bool rcu_boot_ended __read_mostly; +static bool rcu_boot_end_called __read_mostly; +static DEFINE_MUTEX(rcu_boot_end_lock); + +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp) +{ + uint end_ms; + int ret = kstrtouint(val, 0, &end_ms); + + if (ret) + return ret; + WRITE_ONCE(*(uint *)kp->arg, end_ms); + + /* + * rcu_end_inkernel_boot() should be called at least once during init + * before we can allow param changes to end the boot. + */ + mutex_lock(&rcu_boot_end_lock); + rcu_boot_end_delay = end_ms; + if (!rcu_boot_ended && rcu_boot_end_called) { + mutex_unlock(&rcu_boot_end_lock); + rcu_end_inkernel_boot(); + } + mutex_unlock(&rcu_boot_end_lock); + return ret; +} + +static const struct kernel_param_ops rcu_boot_end_ops = { + .set = param_set_rcu_boot_end, + .get = param_get_uint, +}; +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644); /* - * Inform RCU of the end of the in-kernel boot sequence. + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed. */ +void rcu_end_inkernel_boot(void); +static void rcu_boot_end_work_fn(struct work_struct *work) +{ + rcu_end_inkernel_boot(); +} +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn); + void rcu_end_inkernel_boot(void) { + mutex_lock(&rcu_boot_end_lock); + rcu_boot_end_called = true; + + if (rcu_boot_ended) + return; + + if (rcu_boot_end_delay) { + u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL); + + if (boot_ms < rcu_boot_end_delay) { + schedule_delayed_work(&rcu_boot_end_work, + rcu_boot_end_delay - boot_ms); + mutex_unlock(&rcu_boot_end_lock); + return; + } + } + + cancel_delayed_work(&rcu_boot_end_work); rcu_unexpedite_gp(); rcu_async_relax(); if (rcu_normal_after_boot) WRITE_ONCE(rcu_normal, 1); rcu_boot_ended = true; + mutex_unlock(&rcu_boot_end_lock); } /*