Message ID | 20200430201125.532129-1-daniel.m.jordan@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | padata: parallelize deferred page init | expand |
On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > Sometimes the kernel doesn't take full advantage of system memory > bandwidth, leading to a single CPU spending excessive time in > initialization paths where the data scales with memory size. > > Multithreading naturally addresses this problem, and this series is the > first step. > > It extends padata, a framework that handles many parallel singlethreaded > jobs, to handle multithreaded jobs as well by adding support for > splitting up the work evenly, specifying a minimum amount of work that's > appropriate for one helper thread to do, load balancing between helpers, > and coordinating them. More documentation in patches 4 and 7. > > The first user is deferred struct page init, a large bottleneck in > kernel boot--actually the largest for us and likely others too. This > path doesn't require concurrency limits, resource control, or priority > adjustments like future users will (vfio, hugetlb fallocate, munmap) > because it happens during boot when the system is otherwise idle and > waiting on page init to finish. > > This has been tested on a variety of x86 systems and speeds up kernel > boot by 6% to 49% by making deferred init 63% to 91% faster. How long is this up-to-91% in seconds? If it's 91% of a millisecond then not impressed. If it's 91% of two weeks then better :) Relatedly, how important is boot time on these large machines anyway? They presumably have lengthy uptimes so boot time is relatively unimportant? IOW, can you please explain more fully why this patchset is valuable to our users?
On Thu, Apr 30, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > > Sometimes the kernel doesn't take full advantage of system memory > > bandwidth, leading to a single CPU spending excessive time in > > initialization paths where the data scales with memory size. > > > > Multithreading naturally addresses this problem, and this series is the > > first step. > > > > It extends padata, a framework that handles many parallel singlethreaded > > jobs, to handle multithreaded jobs as well by adding support for > > splitting up the work evenly, specifying a minimum amount of work that's > > appropriate for one helper thread to do, load balancing between helpers, > > and coordinating them. More documentation in patches 4 and 7. > > > > The first user is deferred struct page init, a large bottleneck in > > kernel boot--actually the largest for us and likely others too. This > > path doesn't require concurrency limits, resource control, or priority > > adjustments like future users will (vfio, hugetlb fallocate, munmap) > > because it happens during boot when the system is otherwise idle and > > waiting on page init to finish. > > > > This has been tested on a variety of x86 systems and speeds up kernel > > boot by 6% to 49% by making deferred init 63% to 91% faster. > > How long is this up-to-91% in seconds? If it's 91% of a millisecond > then not impressed. If it's 91% of two weeks then better :) > > Relatedly, how important is boot time on these large machines anyway? > They presumably have lengthy uptimes so boot time is relatively > unimportant? Large machines indeed have a lengthy uptime, but they also can host a large number of VMs meaning that downtime of the host increases the downtime of VMs in cloud environments. Some VMs might be very sensible to downtime: game servers, traders, etc. > > IOW, can you please explain more fully why this patchset is valuable to > our users?
On Thu, Apr 30, 2020 at 02:31:31PM -0700, Andrew Morton wrote: > On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > Sometimes the kernel doesn't take full advantage of system memory > > bandwidth, leading to a single CPU spending excessive time in > > initialization paths where the data scales with memory size. > > > > Multithreading naturally addresses this problem, and this series is the > > first step. > > > > It extends padata, a framework that handles many parallel singlethreaded > > jobs, to handle multithreaded jobs as well by adding support for > > splitting up the work evenly, specifying a minimum amount of work that's > > appropriate for one helper thread to do, load balancing between helpers, > > and coordinating them. More documentation in patches 4 and 7. > > > > The first user is deferred struct page init, a large bottleneck in > > kernel boot--actually the largest for us and likely others too. This > > path doesn't require concurrency limits, resource control, or priority > > adjustments like future users will (vfio, hugetlb fallocate, munmap) > > because it happens during boot when the system is otherwise idle and > > waiting on page init to finish. > > > > This has been tested on a variety of x86 systems and speeds up kernel > > boot by 6% to 49% by making deferred init 63% to 91% faster. > > How long is this up-to-91% in seconds? If it's 91% of a millisecond > then not impressed. If it's 91% of two weeks then better :) Some test results on a system with 96 CPUs and 192GB of memory: Without this patch series: [ 0.487132] node 0 initialised, 23398907 pages in 292ms [ 0.499132] node 1 initialised, 24189223 pages in 304ms ... [ 0.629376] Run /sbin/init as init process With this patch series: [ 0.227868] node 0 initialised, 23398907 pages in 28ms [ 0.230019] node 1 initialised, 24189223 pages in 28ms ... [ 0.361069] Run /sbin/init as init process That makes a huge difference; memory initialization is the largest remaining component of boot time. > Relatedly, how important is boot time on these large machines anyway? > They presumably have lengthy uptimes so boot time is relatively > unimportant? Cloud systems and other virtual machines may have a huge amount of memory but not necessarily run for a long time; on such systems, boot time becomes critically important. Reducing boot time on cloud systems and VMs makes the difference between "leave running to reduce latency" and "just boot up when needed". - Josh Triplett
On Thu, Apr 30, 2020 at 04:11:18PM -0400, Daniel Jordan wrote: > Sometimes the kernel doesn't take full advantage of system memory > bandwidth, leading to a single CPU spending excessive time in > initialization paths where the data scales with memory size. > > Multithreading naturally addresses this problem, and this series is the > first step. > > It extends padata, a framework that handles many parallel singlethreaded > jobs, to handle multithreaded jobs as well by adding support for > splitting up the work evenly, specifying a minimum amount of work that's > appropriate for one helper thread to do, load balancing between helpers, > and coordinating them. More documentation in patches 4 and 7. > > The first user is deferred struct page init, a large bottleneck in > kernel boot--actually the largest for us and likely others too. This > path doesn't require concurrency limits, resource control, or priority > adjustments like future users will (vfio, hugetlb fallocate, munmap) > because it happens during boot when the system is otherwise idle and > waiting on page init to finish. > > This has been tested on a variety of x86 systems and speeds up kernel > boot by 6% to 49% by making deferred init 63% to 91% faster. Patch 6 > has detailed numbers. Test results from other systems appreciated. > > This series is based on v5.6 plus these three from mmotm: > > mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch > mm-initialize-deferred-pages-with-interrupts-enabled.patch > mm-call-cond_resched-from-deferred_init_memmap.patch > > All of the above can be found in this branch: > > git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1 > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1 For the series (and the three prerequisite patches): Tested-by: Josh Triplett <josh@joshtriplett.org> Thank you for writing this, and thank you for working towards upstreaming it!
On Thu, Apr 30, 2020 at 05:40:59PM -0400, Pavel Tatashin wrote: > On Thu, Apr 30, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > > > > Sometimes the kernel doesn't take full advantage of system memory > > > bandwidth, leading to a single CPU spending excessive time in > > > initialization paths where the data scales with memory size. > > > > > > Multithreading naturally addresses this problem, and this series is the > > > first step. > > > > > > It extends padata, a framework that handles many parallel singlethreaded > > > jobs, to handle multithreaded jobs as well by adding support for > > > splitting up the work evenly, specifying a minimum amount of work that's > > > appropriate for one helper thread to do, load balancing between helpers, > > > and coordinating them. More documentation in patches 4 and 7. > > > > > > The first user is deferred struct page init, a large bottleneck in > > > kernel boot--actually the largest for us and likely others too. This > > > path doesn't require concurrency limits, resource control, or priority > > > adjustments like future users will (vfio, hugetlb fallocate, munmap) > > > because it happens during boot when the system is otherwise idle and > > > waiting on page init to finish. > > > > > > This has been tested on a variety of x86 systems and speeds up kernel > > > boot by 6% to 49% by making deferred init 63% to 91% faster. > > > > How long is this up-to-91% in seconds? If it's 91% of a millisecond > > then not impressed. If it's 91% of two weeks then better :) The largest system I could test had 384G per node and saved 1.5 out of 4 seconds. > > Relatedly, how important is boot time on these large machines anyway? > > They presumably have lengthy uptimes so boot time is relatively > > unimportant? > > Large machines indeed have a lengthy uptime, but they also can host a > large number of VMs meaning that downtime of the host increases the > downtime of VMs in cloud environments. Some VMs might be very sensible > to downtime: game servers, traders, etc. > > > IOW, can you please explain more fully why this patchset is valuable to > > our users? I'll let the users speak for themselves, but I have a similar use case to Pavel of limiting the downtime of VMs running on these large systems, and spinning up instances as fast as possible is also desirable for our cloud users.
On Thu, Apr 30, 2020 at 06:09:35PM -0700, Josh Triplett wrote: > On Thu, Apr 30, 2020 at 04:11:18PM -0400, Daniel Jordan wrote: > > Sometimes the kernel doesn't take full advantage of system memory > > bandwidth, leading to a single CPU spending excessive time in > > initialization paths where the data scales with memory size. > > > > Multithreading naturally addresses this problem, and this series is the > > first step. > > > > It extends padata, a framework that handles many parallel singlethreaded > > jobs, to handle multithreaded jobs as well by adding support for > > splitting up the work evenly, specifying a minimum amount of work that's > > appropriate for one helper thread to do, load balancing between helpers, > > and coordinating them. More documentation in patches 4 and 7. > > > > The first user is deferred struct page init, a large bottleneck in > > kernel boot--actually the largest for us and likely others too. This > > path doesn't require concurrency limits, resource control, or priority > > adjustments like future users will (vfio, hugetlb fallocate, munmap) > > because it happens during boot when the system is otherwise idle and > > waiting on page init to finish. > > > > This has been tested on a variety of x86 systems and speeds up kernel > > boot by 6% to 49% by making deferred init 63% to 91% faster. Patch 6 > > has detailed numbers. Test results from other systems appreciated. > > > > This series is based on v5.6 plus these three from mmotm: > > > > mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch > > mm-initialize-deferred-pages-with-interrupts-enabled.patch > > mm-call-cond_resched-from-deferred_init_memmap.patch > > > > All of the above can be found in this branch: > > > > git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1 > > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1 > > For the series (and the three prerequisite patches): > > Tested-by: Josh Triplett <josh@joshtriplett.org> Appreciate the runs, Josh, thanks.