Message ID | 20211007095129.22037-1-andriy.shevchenko@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | kernel.h further split | expand |
On Thu, Oct 07, 2021 at 12:51:25PM +0300, Andy Shevchenko wrote: > The kernel.h is a set of something which is not related to each other > and often used in non-crossed compilation units, especially when drivers > need only one or two macro definitions from it. > > Here is the split of container_of(). The goals are the following: > - untwist the dependency hell a bit > - drop kernel.h inclusion where it's only used for container_of() > - speed up C preprocessing. > > People, like Greg KH and Miguel Ojeda, were asking about the latter. > Read below the methodology and test setup with outcome numbers. > > The methodology > =============== > The question here is how to measure in the more or less clean way > the C preprocessing time when building a project like Linux kernel. > To answer it, let's look around and see what tools do we have that > may help. Aha, here is ccache tool that seems quite plausible to > be used. Its core idea is to preprocess C file, count hash (MD4) > and compare to ones that are in the cache. If found, return the > object file, avoiding compilation stage. > > Taking into account the property of the ccache, configure and use > it in the below steps: > > 1. Configure kernel with allyesconfig > > 2. Make it with `make` to be sure that the cache is filled with > the latest data. I.o.w. warm up the cache. > > 3. Run `make -s` (silent mode to reduce the influence of > the unrelated things, like console output) 10 times and > measure 'real' time spent. > > 4. Repeat 1-3 for each patch or patch set to get data sets before > and after. > > When we get the raw data, calculating median will show us the number. > Comparing them before and after we will see the difference. > > The setup > ========= > I have used the Intel x86_64 server platform (see partial output of > `lscpu` below): > > $ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Address sizes: 46 bits physical, 48 bits virtual > Byte Order: Little Endian > CPU(s): 88 > On-line CPU(s) list: 0-87 > Vendor ID: GenuineIntel > Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > CPU family: 6 > Model: 79 > Thread(s) per core: 2 > Core(s) per socket: 22 > Socket(s): 2 > Stepping: 1 > CPU max MHz: 3600.0000 > CPU min MHz: 1200.0000 > ... > Caches (sum of all): > L1d: 1.4 MiB (44 instances) > L1i: 1.4 MiB (44 instances) > L2: 11 MiB (44 instances) > L3: 110 MiB (2 instances) > NUMA: > NUMA node(s): 2 > NUMA node0 CPU(s): 0-21,44-65 > NUMA node1 CPU(s): 22-43,66-87 > Vulnerabilities: > Itlb multihit: KVM: Mitigation: Split huge pages > L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable > Mds: Mitigation; Clear CPU buffers; SMT vulnerable > Meltdown: Mitigation; PTI > Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp > Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization > Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling > Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable > > With the following GCC: > > $ gcc --version > gcc (Debian 10.3.0-11) 10.3.0 > > The commands I have run during the measurement were: > > rm -rf $O > make O=$O allyesconfig > time make O=$O -s -j64 # this step has been measured > > The raw data and median > ======================= > Before patch 2 (yes, I have measured the only patch 2 effect) in the series > (the data is sorted by time): > > real 2m8.794s > real 2m11.183s > real 2m11.235s > real 2m11.639s > real 2m11.960s > real 2m12.014s > real 2m12.609s > real 2m13.177s > real 2m13.462s > real 2m19.132s > > After patch 2 has been applied: > > real 2m8.536s > real 2m8.776s > real 2m9.071s > real 2m9.459s > real 2m9.531s > real 2m9.610s > real 2m10.356s > real 2m10.430s > real 2m11.117s > real 2m11.885s > > Median values are: > 131.987s before > 129.571s after > > We see the steady speedup as of 1.83%. You do know about kcbench: https://gitlab.com/knurd42/kcbench.git Try running that to make it such that we know how it was tested :) thanks, greg k-h > > Andy Shevchenko (4): > kernel.h: Drop unneeded <linux/kernel.h> inclusion from other headers > kernel.h: Split out container_of() and typeof_member() macros > lib/rhashtable: Replace kernel.h with the necessary inclusions > kunit: Replace kernel.h with the necessary inclusions > > include/kunit/test.h | 14 ++++++++++++-- > include/linux/container_of.h | 37 ++++++++++++++++++++++++++++++++++++ > include/linux/kernel.h | 31 +----------------------------- > include/linux/kobject.h | 1 + > include/linux/list.h | 6 ++++-- > include/linux/llist.h | 4 +++- > include/linux/plist.h | 5 ++++- > include/linux/rwsem.h | 1 - > include/linux/spinlock.h | 1 - > include/media/media-entity.h | 3 ++- > lib/radix-tree.c | 6 +++++- > lib/rhashtable.c | 7 ++++++- > 12 files changed, 75 insertions(+), 41 deletions(-) > create mode 100644 include/linux/container_of.h > > -- > 2.33.0 >
On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Thu, Oct 07, 2021 at 12:51:25PM +0300, Andy Shevchenko wrote: > > The kernel.h is a set of something which is not related to each other > > and often used in non-crossed compilation units, especially when drivers > > need only one or two macro definitions from it. > > > > Here is the split of container_of(). The goals are the following: > > - untwist the dependency hell a bit > > - drop kernel.h inclusion where it's only used for container_of() > > - speed up C preprocessing. > > > > People, like Greg KH and Miguel Ojeda, were asking about the latter. > > Read below the methodology and test setup with outcome numbers. > > > > The methodology > > =============== > > The question here is how to measure in the more or less clean way > > the C preprocessing time when building a project like Linux kernel. > > To answer it, let's look around and see what tools do we have that > > may help. Aha, here is ccache tool that seems quite plausible to > > be used. Its core idea is to preprocess C file, count hash (MD4) > > and compare to ones that are in the cache. If found, return the > > object file, avoiding compilation stage. > > > > Taking into account the property of the ccache, configure and use > > it in the below steps: > > > > 1. Configure kernel with allyesconfig > > > > 2. Make it with `make` to be sure that the cache is filled with > > the latest data. I.o.w. warm up the cache. > > > > 3. Run `make -s` (silent mode to reduce the influence of > > the unrelated things, like console output) 10 times and > > measure 'real' time spent. > > > > 4. Repeat 1-3 for each patch or patch set to get data sets before > > and after. > > > > When we get the raw data, calculating median will show us the number. > > Comparing them before and after we will see the difference. > > > > The setup > > ========= > > I have used the Intel x86_64 server platform (see partial output of > > `lscpu` below): > > > > $ lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Address sizes: 46 bits physical, 48 bits virtual > > Byte Order: Little Endian > > CPU(s): 88 > > On-line CPU(s) list: 0-87 > > Vendor ID: GenuineIntel > > Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > > CPU family: 6 > > Model: 79 > > Thread(s) per core: 2 > > Core(s) per socket: 22 > > Socket(s): 2 > > Stepping: 1 > > CPU max MHz: 3600.0000 > > CPU min MHz: 1200.0000 > > ... > > Caches (sum of all): > > L1d: 1.4 MiB (44 instances) > > L1i: 1.4 MiB (44 instances) > > L2: 11 MiB (44 instances) > > L3: 110 MiB (2 instances) > > NUMA: > > NUMA node(s): 2 > > NUMA node0 CPU(s): 0-21,44-65 > > NUMA node1 CPU(s): 22-43,66-87 > > Vulnerabilities: > > Itlb multihit: KVM: Mitigation: Split huge pages > > L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable > > Mds: Mitigation; Clear CPU buffers; SMT vulnerable > > Meltdown: Mitigation; PTI > > Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp > > Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization > > Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling > > Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable > > > > With the following GCC: > > > > $ gcc --version > > gcc (Debian 10.3.0-11) 10.3.0 > > > > The commands I have run during the measurement were: > > > > rm -rf $O > > make O=$O allyesconfig > > time make O=$O -s -j64 # this step has been measured > > > > The raw data and median > > ======================= > > Before patch 2 (yes, I have measured the only patch 2 effect) in the series > > (the data is sorted by time): > > > > real 2m8.794s > > real 2m11.183s > > real 2m11.235s > > real 2m11.639s > > real 2m11.960s > > real 2m12.014s > > real 2m12.609s > > real 2m13.177s > > real 2m13.462s > > real 2m19.132s > > > > After patch 2 has been applied: > > > > real 2m8.536s > > real 2m8.776s > > real 2m9.071s > > real 2m9.459s > > real 2m9.531s > > real 2m9.610s > > real 2m10.356s > > real 2m10.430s > > real 2m11.117s > > real 2m11.885s > > > > Median values are: > > 131.987s before > > 129.571s after > > > > We see the steady speedup as of 1.83%. > > You do know about kcbench: > https://gitlab.com/knurd42/kcbench.git > > Try running that to make it such that we know how it was tested :) I'll try it. Meanwhile, Thorsten, can you have a look at my approach and tell if it makes sense?
On Thu, Oct 07, 2021 at 02:51:15PM +0300, Andy Shevchenko wrote: > On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > On Thu, Oct 07, 2021 at 12:51:25PM +0300, Andy Shevchenko wrote: > > > The kernel.h is a set of something which is not related to each other > > > and often used in non-crossed compilation units, especially when drivers > > > need only one or two macro definitions from it. > > > > > > Here is the split of container_of(). The goals are the following: > > > - untwist the dependency hell a bit > > > - drop kernel.h inclusion where it's only used for container_of() > > > - speed up C preprocessing. > > > > > > People, like Greg KH and Miguel Ojeda, were asking about the latter. > > > Read below the methodology and test setup with outcome numbers. > > > > > > The methodology > > > =============== > > > The question here is how to measure in the more or less clean way > > > the C preprocessing time when building a project like Linux kernel. > > > To answer it, let's look around and see what tools do we have that > > > may help. Aha, here is ccache tool that seems quite plausible to > > > be used. Its core idea is to preprocess C file, count hash (MD4) > > > and compare to ones that are in the cache. If found, return the > > > object file, avoiding compilation stage. > > > > > > Taking into account the property of the ccache, configure and use > > > it in the below steps: > > > > > > 1. Configure kernel with allyesconfig > > > > > > 2. Make it with `make` to be sure that the cache is filled with > > > the latest data. I.o.w. warm up the cache. > > > > > > 3. Run `make -s` (silent mode to reduce the influence of > > > the unrelated things, like console output) 10 times and > > > measure 'real' time spent. > > > > > > 4. Repeat 1-3 for each patch or patch set to get data sets before > > > and after. > > > > > > When we get the raw data, calculating median will show us the number. > > > Comparing them before and after we will see the difference. > > > > > > The setup > > > ========= > > > I have used the Intel x86_64 server platform (see partial output of > > > `lscpu` below): > > > > > > $ lscpu > > > Architecture: x86_64 > > > CPU op-mode(s): 32-bit, 64-bit > > > Address sizes: 46 bits physical, 48 bits virtual > > > Byte Order: Little Endian > > > CPU(s): 88 > > > On-line CPU(s) list: 0-87 > > > Vendor ID: GenuineIntel > > > Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > > > CPU family: 6 > > > Model: 79 > > > Thread(s) per core: 2 > > > Core(s) per socket: 22 > > > Socket(s): 2 > > > Stepping: 1 > > > CPU max MHz: 3600.0000 > > > CPU min MHz: 1200.0000 > > > ... > > > Caches (sum of all): > > > L1d: 1.4 MiB (44 instances) > > > L1i: 1.4 MiB (44 instances) > > > L2: 11 MiB (44 instances) > > > L3: 110 MiB (2 instances) > > > NUMA: > > > NUMA node(s): 2 > > > NUMA node0 CPU(s): 0-21,44-65 > > > NUMA node1 CPU(s): 22-43,66-87 > > > Vulnerabilities: > > > Itlb multihit: KVM: Mitigation: Split huge pages > > > L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable > > > Mds: Mitigation; Clear CPU buffers; SMT vulnerable > > > Meltdown: Mitigation; PTI > > > Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp > > > Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization > > > Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling > > > Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable > > > > > > With the following GCC: > > > > > > $ gcc --version > > > gcc (Debian 10.3.0-11) 10.3.0 > > > > > > The commands I have run during the measurement were: > > > > > > rm -rf $O > > > make O=$O allyesconfig > > > time make O=$O -s -j64 # this step has been measured > > > > > > The raw data and median > > > ======================= > > > Before patch 2 (yes, I have measured the only patch 2 effect) in the series > > > (the data is sorted by time): > > > > > > real 2m8.794s > > > real 2m11.183s > > > real 2m11.235s > > > real 2m11.639s > > > real 2m11.960s > > > real 2m12.014s > > > real 2m12.609s > > > real 2m13.177s > > > real 2m13.462s > > > real 2m19.132s > > > > > > After patch 2 has been applied: > > > > > > real 2m8.536s > > > real 2m8.776s > > > real 2m9.071s > > > real 2m9.459s > > > real 2m9.531s > > > real 2m9.610s > > > real 2m10.356s > > > real 2m10.430s > > > real 2m11.117s > > > real 2m11.885s > > > > > > Median values are: > > > 131.987s before > > > 129.571s after > > > > > > We see the steady speedup as of 1.83%. > > > > You do know about kcbench: > > https://gitlab.com/knurd42/kcbench.git > > > > Try running that to make it such that we know how it was tested :) > > I'll try it. > > Meanwhile, Thorsten, can you have a look at my approach and tell if it > makes sense? No, do not use ccache when trying to benchmark the speed of kernel builds, that tests the speed of your disk subsystem... thanks, greg k-h
On Thu, Oct 07, 2021 at 03:59:08PM +0200, Greg Kroah-Hartman wrote: > On Thu, Oct 07, 2021 at 02:51:15PM +0300, Andy Shevchenko wrote: > > On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: ... > > Meanwhile, Thorsten, can you have a look at my approach and tell if it > > makes sense? > > No, do not use ccache when trying to benchmark the speed of kernel > builds, that tests the speed of your disk subsystem... First rule of the measurement is to be sure WHAT we are measuring. And I'm pretty much explained WHAT and HOW. On the other hand, the kcbench can't answer to the question about C preprocessing speed without help of ccache or something similar. Measuring complete build is exactly not what we want because of O(compilation) vs. o(C preprocessing) meaning that any fluctuation in the former makes silly to measure anything from the latter. You see, my theory is proved by practical experiment: $ kcbench -i 3 -j 64 -o $O -s $PWD --no-download -m Processor: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz [88 CPUs] Cpufreq; Memory: powersave [intel_pstate]; 128823 MiB Linux running: 5.6.0-2-amd64 [x86_64] Compiler: gcc (Debian 10.3.0-11) 10.3.0 Linux compiled: 5.15.0-rc4 Config; Environment: allmodconfig; CCACHE_DISABLE="1" Build command: make vmlinux modules Filling caches: This might take a while... Done Run 1 (-j 64): 464.07 seconds / 7.76 kernels/hour [P:6001%] Run 2 (-j 64): 464.64 seconds / 7.75 kernels/hour [P:6000%] Run 3 (-j 64): 486.41 seconds / 7.40 kernels/hour [P:5727%] $ kcbench -i 3 -j 64 -o $O -s $PWD --no-download -m Processor: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz [88 CPUs] Cpufreq; Memory: powersave [intel_pstate]; 128823 MiB Linux running: 5.6.0-2-amd64 [x86_64] Compiler: gcc (Debian 10.3.0-11) 10.3.0 Linux compiled: 5.15.0-rc4 Config; Environment: allmodconfig; CCACHE_DISABLE="1" Build command: make vmlinux modules Filling caches: This might take a while... Done Run 1 (-j 64): 462.32 seconds / 7.79 kernels/hour [P:6009%] Run 2 (-j 64): 462.33 seconds / 7.79 kernels/hour [P:6006%] Run 3 (-j 64): 465.45 seconds / 7.73 kernels/hour [P:5999%] In [41]: numpy.median(y1) Out[41]: 464.64 In [42]: numpy.median(y2) Out[42]: 462.33 Speedup: +0.5%
On Thu, Oct 07, 2021 at 05:47:31PM +0300, Andy Shevchenko wrote: > On Thu, Oct 07, 2021 at 03:59:08PM +0200, Greg Kroah-Hartman wrote: > > On Thu, Oct 07, 2021 at 02:51:15PM +0300, Andy Shevchenko wrote: > > > On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > ... > > > > Meanwhile, Thorsten, can you have a look at my approach and tell if it > > > makes sense? > > > > No, do not use ccache when trying to benchmark the speed of kernel > > builds, that tests the speed of your disk subsystem... > > First rule of the measurement is to be sure WHAT we are measuring. > And I'm pretty much explained WHAT and HOW. On the other hand, the > kcbench can't answer to the question about C preprocessing speed > without help of ccache or something similar. > > Measuring complete build is exactly not what we want because of > O(compilation) vs. o(C preprocessing) meaning that any fluctuation > in the former makes silly to measure anything from the latter. > > You see, my theory is proved by practical experiment: > > $ kcbench -i 3 -j 64 -o $O -s $PWD --no-download -m > Processor: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz [88 CPUs] > Cpufreq; Memory: powersave [intel_pstate]; 128823 MiB > Linux running: 5.6.0-2-amd64 [x86_64] > Compiler: gcc (Debian 10.3.0-11) 10.3.0 > Linux compiled: 5.15.0-rc4 > Config; Environment: allmodconfig; CCACHE_DISABLE="1" > Build command: make vmlinux modules > Filling caches: This might take a while... Done > Run 1 (-j 64): 464.07 seconds / 7.76 kernels/hour [P:6001%] > Run 2 (-j 64): 464.64 seconds / 7.75 kernels/hour [P:6000%] > Run 3 (-j 64): 486.41 seconds / 7.40 kernels/hour [P:5727%] > > $ kcbench -i 3 -j 64 -o $O -s $PWD --no-download -m > Processor: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz [88 CPUs] > Cpufreq; Memory: powersave [intel_pstate]; 128823 MiB > Linux running: 5.6.0-2-amd64 [x86_64] > Compiler: gcc (Debian 10.3.0-11) 10.3.0 > Linux compiled: 5.15.0-rc4 > Config; Environment: allmodconfig; CCACHE_DISABLE="1" > Build command: make vmlinux modules > Filling caches: This might take a while... Done > Run 1 (-j 64): 462.32 seconds / 7.79 kernels/hour [P:6009%] > Run 2 (-j 64): 462.33 seconds / 7.79 kernels/hour [P:6006%] > Run 3 (-j 64): 465.45 seconds / 7.73 kernels/hour [P:5999%] > > In [41]: numpy.median(y1) > Out[41]: 464.64 > > In [42]: numpy.median(y2) > Out[42]: 462.33 > > Speedup: +0.5% Good, you measured what actually matters here, the real compilation of the code, not just the pre-processing of it. thanks, greg k-h
(sorry, sending it a second time with a different mail client, as vger rejected my earlier mail with the "Content-Policy reject msg: Wrong MIME labeling on 8-bit character texts." – and as of now I'm unable to figure out what's wrong :-/ ) On Thu, 7 Oct 2021 14:51:15 +0300 Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > On Thu, Oct 07, 2021 at 12:51:25PM +0300, Andy Shevchenko wrote: > > > The kernel.h is a set of something which is not related to each > > > other and often used in non-crossed compilation units, especially > > > when drivers need only one or two macro definitions from it. > > > > > > Here is the split of container_of(). The goals are the following: > > > - untwist the dependency hell a bit > > > - drop kernel.h inclusion where it's only used for container_of() > > > - speed up C preprocessing. > > > > > > People, like Greg KH and Miguel Ojeda, were asking about the > > > latter. Read below the methodology and test setup with outcome > > > numbers. > > > > > > The methodology > > > =============== > > > The question here is how to measure in the more or less clean way > > > the C preprocessing time when building a project like Linux > > > kernel. To answer it, let's look around and see what tools do we > > > have that may help. Aha, here is ccache tool that seems quite > > > plausible to be used. Its core idea is to preprocess C file, > > > count hash (MD4) and compare to ones that are in the cache. If > > > found, return the object file, avoiding compilation stage. > > > > > > Taking into account the property of the ccache, configure and use > > > it in the below steps: > > > > > > 1. Configure kernel with allyesconfig > > > > > > 2. Make it with `make` to be sure that the cache is filled with > > > the latest data. I.o.w. warm up the cache. > > > > > > 3. Run `make -s` (silent mode to reduce the influence of > > > the unrelated things, like console output) 10 times and > > > measure 'real' time spent. > > > > > > 4. Repeat 1-3 for each patch or patch set to get data sets before > > > and after. > > > > > > When we get the raw data, calculating median will show us the > > > number. Comparing them before and after we will see the > > > difference. > > > > > > The setup > > > ========= > > > I have used the Intel x86_64 server platform (see partial output > > > of `lscpu` below): > > > > > > $ lscpu > > > Architecture: x86_64 > > > CPU op-mode(s): 32-bit, 64-bit > > > Address sizes: 46 bits physical, 48 bits virtual > > > Byte Order: Little Endian > > > CPU(s): 88 > > > On-line CPU(s) list: 0-87 > > > Vendor ID: GenuineIntel > > > Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > > > CPU family: 6 > > > Model: 79 > > > Thread(s) per core: 2 > > > Core(s) per socket: 22 > > > Socket(s): 2 > > > Stepping: 1 > > > CPU max MHz: 3600.0000 > > > CPU min MHz: 1200.0000 > > > ... > > > Caches (sum of all): > > > L1d: 1.4 MiB (44 instances) > > > L1i: 1.4 MiB (44 instances) > > > L2: 11 MiB (44 instances) > > > L3: 110 MiB (2 instances) > > > NUMA: > > > NUMA node(s): 2 > > > NUMA node0 CPU(s): 0-21,44-65 > > > NUMA node1 CPU(s): 22-43,66-87 > > > Vulnerabilities: > > > Itlb multihit: KVM: Mitigation: Split huge pages > > > L1tf: Mitigation; PTE Inversion; VMX > > > conditional cache flushes, SMT vulnerable Mds: > > > Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: > > > Mitigation; PTI Spec store bypass: Mitigation; Speculative > > > Store Bypass disabled via prctl and seccomp Spectre v1: > > > Mitigation; usercopy/swapgs barriers and __user pointer > > > sanitization Spectre v2: Mitigation; Full generic > > > retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB > > > filling Tsx async abort: Mitigation; Clear CPU buffers; SMT > > > vulnerable > > > > > > With the following GCC: > > > > > > $ gcc --version > > > gcc (Debian 10.3.0-11) 10.3.0 > > > > > > The commands I have run during the measurement were: > > > > > > rm -rf $O > > > make O=$O allyesconfig > > > time make O=$O -s -j64 # this step has been measured BTW, what kcbench does in the end is not that different, but it only builds the config once and that uses it for all further testing. > > > The raw data and median > > > ======================= > > > Before patch 2 (yes, I have measured the only patch 2 effect) in > > > the series (the data is sorted by time): > > > > > > real 2m8.794s > > > real 2m11.183s > > > real 2m11.235s > > > real 2m11.639s > > > real 2m11.960s > > > real 2m12.014s > > > real 2m12.609s > > > real 2m13.177s > > > real 2m13.462s > > > real 2m19.132s > > > > > > After patch 2 has been applied: > > > > > > real 2m8.536s > > > real 2m8.776s > > > real 2m9.071s > > > real 2m9.459s > > > real 2m9.531s > > > real 2m9.610s > > > real 2m10.356s > > > real 2m10.430s > > > real 2m11.117s > > > real 2m11.885s > > > > > > Median values are: > > > 131.987s before > > > 129.571s after > > > > > > We see the steady speedup as of 1.83%. > > > > You do know about kcbench: > > https://gitlab.com/knurd42/kcbench.git > > > > Try running that to make it such that we know how it was tested :) > > I'll try it. > > Meanwhile, Thorsten, can you have a look at my approach and tell if it > makes sense? I'm not the right person to ask here, I don't know enough about the inner working of ccache and C preprocessing. Reminder: I'm not a real kernel/C developer, but more kind of a parasite that lives on the fringes of kernel development. ;-) Kcbench in fact originated as a benchmark magazine for the computer magazine I used to work for – where I also did quite a few benchmarks. But that knowledge might be helpful here: The measurements before and after patch 2 was applied get slower over time. That is a hint that something is interfering. Is the disk filling up and making the fs do more work? Or is the machine getting to hot? It IMHO would be worth investigating and ruling out, as the differences you are looking out are likely quite small Also: the last run of the first measurement cycle is off by quite a bit, so I wouldn't even include the result, as there like was something that disturbed the benchmark. And I might be missing something, but why were you using "-j 64" on a machine with 44 cores/88 threads? I wonder if that might lead do interesting effects due to SMT (some core will run two threads, other only one). Using either "-j 44" or "-j 88" might be better. But I suggest you run kcbench once without specifying "-j", as that will check which setting is the fastest on this system – and then use that for all further tests. HTH, Ciao, Thorsten
On Fri, Oct 08, 2021 at 11:37:58AM +0200, Thorsten Leemhuis wrote: > On Thu, 7 Oct 2021 14:51:15 +0300 > Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > > On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > On Thu, Oct 07, 2021 at 12:51:25PM +0300, Andy Shevchenko wrote: > > > > The kernel.h is a set of something which is not related to each > > > > other and often used in non-crossed compilation units, especially > > > > when drivers need only one or two macro definitions from it. > > > > > > > > Here is the split of container_of(). The goals are the following: > > > > - untwist the dependency hell a bit > > > > - drop kernel.h inclusion where it's only used for container_of() > > > > - speed up C preprocessing. > > > > > > > > People, like Greg KH and Miguel Ojeda, were asking about the > > > > latter. Read below the methodology and test setup with outcome > > > > numbers. > > > > > > > > The methodology > > > > =============== > > > > The question here is how to measure in the more or less clean way > > > > the C preprocessing time when building a project like Linux > > > > kernel. To answer it, let's look around and see what tools do we > > > > have that may help. Aha, here is ccache tool that seems quite > > > > plausible to be used. Its core idea is to preprocess C file, > > > > count hash (MD4) and compare to ones that are in the cache. If > > > > found, return the object file, avoiding compilation stage. > > > > > > > > Taking into account the property of the ccache, configure and use > > > > it in the below steps: > > > > > > > > 1. Configure kernel with allyesconfig > > > > > > > > 2. Make it with `make` to be sure that the cache is filled with > > > > the latest data. I.o.w. warm up the cache. > > > > > > > > 3. Run `make -s` (silent mode to reduce the influence of > > > > the unrelated things, like console output) 10 times and > > > > measure 'real' time spent. > > > > > > > > 4. Repeat 1-3 for each patch or patch set to get data sets before > > > > and after. > > > > > > > > When we get the raw data, calculating median will show us the > > > > number. Comparing them before and after we will see the > > > > difference. > > > > > > > > The setup > > > > ========= > > > > I have used the Intel x86_64 server platform (see partial output > > > > of `lscpu` below): > > > > > > > > $ lscpu > > > > Architecture: x86_64 > > > > CPU op-mode(s): 32-bit, 64-bit > > > > Address sizes: 46 bits physical, 48 bits virtual > > > > Byte Order: Little Endian > > > > CPU(s): 88 > > > > On-line CPU(s) list: 0-87 > > > > Vendor ID: GenuineIntel > > > > Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > > > > CPU family: 6 > > > > Model: 79 > > > > Thread(s) per core: 2 > > > > Core(s) per socket: 22 > > > > Socket(s): 2 > > > > Stepping: 1 > > > > CPU max MHz: 3600.0000 > > > > CPU min MHz: 1200.0000 > > > > ... > > > > Caches (sum of all): > > > > L1d: 1.4 MiB (44 instances) > > > > L1i: 1.4 MiB (44 instances) > > > > L2: 11 MiB (44 instances) > > > > L3: 110 MiB (2 instances) > > > > NUMA: > > > > NUMA node(s): 2 > > > > NUMA node0 CPU(s): 0-21,44-65 > > > > NUMA node1 CPU(s): 22-43,66-87 > > > > Vulnerabilities: > > > > Itlb multihit: KVM: Mitigation: Split huge pages > > > > L1tf: Mitigation; PTE Inversion; VMX > > > > conditional cache flushes, SMT vulnerable Mds: > > > > Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: > > > > Mitigation; PTI Spec store bypass: Mitigation; Speculative > > > > Store Bypass disabled via prctl and seccomp Spectre v1: > > > > Mitigation; usercopy/swapgs barriers and __user pointer > > > > sanitization Spectre v2: Mitigation; Full generic > > > > retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB > > > > filling Tsx async abort: Mitigation; Clear CPU buffers; SMT > > > > vulnerable > > > > > > > > With the following GCC: > > > > > > > > $ gcc --version > > > > gcc (Debian 10.3.0-11) 10.3.0 > > > > > > > > The commands I have run during the measurement were: > > > > > > > > rm -rf $O > > > > make O=$O allyesconfig > > > > time make O=$O -s -j64 # this step has been measured > > BTW, what kcbench does in the end is not that different, but it only > builds the config once and that uses it for all further testing. Since I measure the third operation only this shouldn't affect recreation of the configuration file. > > > > The raw data and median > > > > ======================= > > > > Before patch 2 (yes, I have measured the only patch 2 effect) in > > > > the series (the data is sorted by time): > > > > > > > > real 2m8.794s > > > > real 2m11.183s > > > > real 2m11.235s > > > > real 2m11.639s > > > > real 2m11.960s > > > > real 2m12.014s > > > > real 2m12.609s > > > > real 2m13.177s > > > > real 2m13.462s > > > > real 2m19.132s > > > > > > > > After patch 2 has been applied: > > > > > > > > real 2m8.536s > > > > real 2m8.776s > > > > real 2m9.071s > > > > real 2m9.459s > > > > real 2m9.531s > > > > real 2m9.610s > > > > real 2m10.356s > > > > real 2m10.430s > > > > real 2m11.117s > > > > real 2m11.885s > > > > > > > > Median values are: > > > > 131.987s before > > > > 129.571s after > > > > > > > > We see the steady speedup as of 1.83%. > > > > > > You do know about kcbench: > > > https://gitlab.com/knurd42/kcbench.git > > > > > > Try running that to make it such that we know how it was tested :) > > > > I'll try it. > > > > Meanwhile, Thorsten, can you have a look at my approach and tell if it > > makes sense? > > I'm not the right person to ask here, I don't know enough about the > inner working of ccache and C preprocessing. Reminder: I'm not a real > kernel/C developer, but more kind of a parasite that lives on the > fringes of kernel development. ;-) Kcbench in fact originated as a > benchmark magazine for the computer magazine I used to work for – where > I also did quite a few benchmarks. But that knowledge might be helpful > here: > > The measurements before and after patch 2 was applied get slower over > time. That is a hint that something is interfering. Is the disk filling > up and making the fs do more work? Or is the machine getting to hot? It > IMHO would be worth investigating and ruling out, as the differences > you are looking out are likely quite small I tried to explain why my methodology is closer to what we need to measure in the above and replies. TL;DR: mathematically the O() shadows o() and as we know the CPU and disk usage during compilation is a huge in comparison to the C preprocessing. I'm not sure what you are referring by "slower over time" since I explicitly said that I have _sorted_ the data. Nothing should be done here, I believe. > Also: the last run of the first measurement cycle is off by quite a > bit, so I wouldn't even include the result, as there like was something > that disturbed the benchmark. I believe you missed the very same remark, i.e. that the data is sorted. > And I might be missing something, but why were you using "-j 64" on a > machine with 44 cores/88 threads? Because that machine has more processes being run. And I would like to minimize fluctuation of the CPU scheduling when some process requires a resource to perform little work. > I wonder if that might lead do > interesting effects due to SMT (some core will run two threads, other > only one). Using either "-j 44" or "-j 88" might be better. How -j64 can be better? Nothing will guarantee that any of the core will be half-loaded. But -j88 is worse because any process that wakes up and requires for a resource may affect the measurements. > But I > suggest you run kcbench once without specifying "-j", as that will > check which setting is the fastest on this system – and then use that > for all further tests. Next time I will try this approach, thanks for your reply and insights!