Message ID | 20231212102857.803984-3-haakon.bugge@oracle.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Paul Moore |
Headers | show |
Series | None | expand |
On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote: > > For the most time-consuming function, when running a syscall benchmark > with STIG compliant audit rules: > > Overhead Command Shared Object Symbol > ......... ............ ................. ........................ > > 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op > > we apply codegen optimizations, which speeds up the syscall > performance by around 17% on an Intel Cascade Lake system. > > We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C > application that measures average syscall latency from getpid() > running 100 million rounds. > > Between each perf run, we reboot the system and waits until the last > minute load is less than 1.0. > > We boot the kernel, v6.6-rc4, with "mitigations=off", in order to > amplify the changes in the audit system. > > Let the base kernel be v6.6-rc4 with booted with "audit=1" and > "mitigations=off" and with the commit "audit: Vary struct audit_entry > alignment" on an Intel Cascade Lake system. The following three > metrics are reported, nanoseconds per syscall, L1D misses per syscall, > and finally Intructions Per Cycle, ipc. > > Base vs. base + this commit gives: > > ns per call: > min avg max pstdev > - 203 203 209 0.954149 > + 173 173 178 0.884534 > > L1d misses per syscall: > min avg max pstdev > - 0.012 0.103 0.817 0.238352 > + 0.010 0.209 1.235 0.399416 > > ipc: > min avg max pstdev > - 2.320 2.329 2.330 0.003000 > + 2.430 2.436 2.440 0.004899 > > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > --- > kernel/auditsc.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index 6f0d6fb6523fa..84d0dfe75a4ac 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > * parameter can be NULL, but all others must be specified. > * Returns 1/true if the filter finds a match, 0/false if none are found. > */ > +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16") The kernel doesn't really make use of #pragma optimization statements like this, at least not in any of the core areas, and I'm not interested in being the first to do so. I appreciate the time and effort that you have spent profiling the audit subsystem, but this isn't a patch I can accept at this point in time, I'm sorry. > static int __audit_filter_op(struct task_struct *tsk, > struct audit_context *ctx, > struct list_head *list, > @@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk, > } > return 0; > } > +#pragma GCC reset_options > > /** > * audit_filter_uring - apply filters to an io_uring operation > -- > 2.39.3
> On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote: > > On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote: >> >> For the most time-consuming function, when running a syscall benchmark >> with STIG compliant audit rules: >> >> Overhead Command Shared Object Symbol >> ......... ............ ................. ........................ >> >> 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op >> >> we apply codegen optimizations, which speeds up the syscall >> performance by around 17% on an Intel Cascade Lake system. >> >> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C >> application that measures average syscall latency from getpid() >> running 100 million rounds. >> >> Between each perf run, we reboot the system and waits until the last >> minute load is less than 1.0. >> >> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to >> amplify the changes in the audit system. >> >> Let the base kernel be v6.6-rc4 with booted with "audit=1" and >> "mitigations=off" and with the commit "audit: Vary struct audit_entry >> alignment" on an Intel Cascade Lake system. The following three >> metrics are reported, nanoseconds per syscall, L1D misses per syscall, >> and finally Intructions Per Cycle, ipc. >> >> Base vs. base + this commit gives: >> >> ns per call: >> min avg max pstdev >> - 203 203 209 0.954149 >> + 173 173 178 0.884534 >> >> L1d misses per syscall: >> min avg max pstdev >> - 0.012 0.103 0.817 0.238352 >> + 0.010 0.209 1.235 0.399416 >> >> ipc: >> min avg max pstdev >> - 2.320 2.329 2.330 0.003000 >> + 2.430 2.436 2.440 0.004899 >> >> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> >> --- >> kernel/auditsc.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/auditsc.c b/kernel/auditsc.c >> index 6f0d6fb6523fa..84d0dfe75a4ac 100644 >> --- a/kernel/auditsc.c >> +++ b/kernel/auditsc.c >> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) >> * parameter can be NULL, but all others must be specified. >> * Returns 1/true if the filter finds a match, 0/false if none are found. >> */ >> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16") > > The kernel doesn't really make use of #pragma optimization statements > like this, at least not in any of the core areas, and I'm not > interested in being the first to do so. I appreciate the time and > effort that you have spent profiling the audit subsystem, but this > isn't a patch I can accept at this point in time, I'm sorry. Fair enough. Will a function attribute aka: __attribute__((optimize("foo=bar"))) be acceptable for you? Thxs, Håkon > >> static int __audit_filter_op(struct task_struct *tsk, >> struct audit_context *ctx, >> struct list_head *list, >> @@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk, >> } >> return 0; >> } >> +#pragma GCC reset_options >> >> /** >> * audit_filter_uring - apply filters to an io_uring operation >> -- >> 2.39.3 > > -- > paul-moore.com >
On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <haakon.bugge@oracle.com> wrote: > > On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote: > > On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote: > >> > >> For the most time-consuming function, when running a syscall benchmark > >> with STIG compliant audit rules: > >> > >> Overhead Command Shared Object Symbol > >> ......... ............ ................. ........................ > >> > >> 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op > >> > >> we apply codegen optimizations, which speeds up the syscall > >> performance by around 17% on an Intel Cascade Lake system. > >> > >> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C > >> application that measures average syscall latency from getpid() > >> running 100 million rounds. > >> > >> Between each perf run, we reboot the system and waits until the last > >> minute load is less than 1.0. > >> > >> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to > >> amplify the changes in the audit system. > >> > >> Let the base kernel be v6.6-rc4 with booted with "audit=1" and > >> "mitigations=off" and with the commit "audit: Vary struct audit_entry > >> alignment" on an Intel Cascade Lake system. The following three > >> metrics are reported, nanoseconds per syscall, L1D misses per syscall, > >> and finally Intructions Per Cycle, ipc. > >> > >> Base vs. base + this commit gives: > >> > >> ns per call: > >> min avg max pstdev > >> - 203 203 209 0.954149 > >> + 173 173 178 0.884534 > >> > >> L1d misses per syscall: > >> min avg max pstdev > >> - 0.012 0.103 0.817 0.238352 > >> + 0.010 0.209 1.235 0.399416 > >> > >> ipc: > >> min avg max pstdev > >> - 2.320 2.329 2.330 0.003000 > >> + 2.430 2.436 2.440 0.004899 > >> > >> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > >> --- > >> kernel/auditsc.c | 2 ++ > >> 1 file changed, 2 insertions(+) > >> > >> diff --git a/kernel/auditsc.c b/kernel/auditsc.c > >> index 6f0d6fb6523fa..84d0dfe75a4ac 100644 > >> --- a/kernel/auditsc.c > >> +++ b/kernel/auditsc.c > >> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > >> * parameter can be NULL, but all others must be specified. > >> * Returns 1/true if the filter finds a match, 0/false if none are found. > >> */ > >> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16") > > > > The kernel doesn't really make use of #pragma optimization statements > > like this, at least not in any of the core areas, and I'm not > > interested in being the first to do so. I appreciate the time and > > effort that you have spent profiling the audit subsystem, but this > > isn't a patch I can accept at this point in time, I'm sorry. > > Fair enough. Will a function attribute aka: > > __attribute__((optimize("foo=bar"))) > > be acceptable for you? Unless you can show me widespread acceptance of these types of optimizations in core kernel code, my answer is no. I'm sorry.
> On 18 Dec 2023, at 23:09, Paul Moore <paul@paul-moore.com> wrote: > > On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <haakon.bugge@oracle.com> wrote: >>> On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote: >>> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote: >>>> >>>> For the most time-consuming function, when running a syscall benchmark >>>> with STIG compliant audit rules: >>>> >>>> Overhead Command Shared Object Symbol >>>> ......... ............ ................. ........................ >>>> >>>> 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op >>>> >>>> we apply codegen optimizations, which speeds up the syscall >>>> performance by around 17% on an Intel Cascade Lake system. >>>> >>>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C >>>> application that measures average syscall latency from getpid() >>>> running 100 million rounds. >>>> >>>> Between each perf run, we reboot the system and waits until the last >>>> minute load is less than 1.0. >>>> >>>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to >>>> amplify the changes in the audit system. >>>> >>>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and >>>> "mitigations=off" and with the commit "audit: Vary struct audit_entry >>>> alignment" on an Intel Cascade Lake system. The following three >>>> metrics are reported, nanoseconds per syscall, L1D misses per syscall, >>>> and finally Intructions Per Cycle, ipc. >>>> >>>> Base vs. base + this commit gives: >>>> >>>> ns per call: >>>> min avg max pstdev >>>> - 203 203 209 0.954149 >>>> + 173 173 178 0.884534 >>>> >>>> L1d misses per syscall: >>>> min avg max pstdev >>>> - 0.012 0.103 0.817 0.238352 >>>> + 0.010 0.209 1.235 0.399416 >>>> >>>> ipc: >>>> min avg max pstdev >>>> - 2.320 2.329 2.330 0.003000 >>>> + 2.430 2.436 2.440 0.004899 >>>> >>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> >>>> --- >>>> kernel/auditsc.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c >>>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644 >>>> --- a/kernel/auditsc.c >>>> +++ b/kernel/auditsc.c >>>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) >>>> * parameter can be NULL, but all others must be specified. >>>> * Returns 1/true if the filter finds a match, 0/false if none are found. >>>> */ >>>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16") >>> >>> The kernel doesn't really make use of #pragma optimization statements >>> like this, at least not in any of the core areas, and I'm not >>> interested in being the first to do so. I appreciate the time and >>> effort that you have spent profiling the audit subsystem, but this >>> isn't a patch I can accept at this point in time, I'm sorry. >> >> Fair enough. Will a function attribute aka: >> >> __attribute__((optimize("foo=bar"))) >> >> be acceptable for you? > > Unless you can show me widespread acceptance of these types of > optimizations in core kernel code, my answer is no. I'm sorry. Hi Paul, Well, I guess we both know that I cannot show you any "widespread acceptance" of the sort. Here is what I see: arch/arm/lib/xor-neon.c:#pragma GCC optimize "tree-vectorize" (not what you describe as core kernel) lib/zstd/common/compiler.h:# define DONT_VECTORIZE __attribute__((optimize("no-tree-vectorize"))) (core, but not widespread) Then of course, there are more examples of specific optimizations or disable thereof in Makefiles: kernel/debug/kdb/Makefile:clean-files := gen-kdb_cmds.c kernel/kcsan/Makefile:CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \ kernel/kcsan/Makefile: -fno-stack-protector -DDISABLE_BRANCH_PROFILING kernel/kcsan/Makefile:CFLAGS_kcsan_test.o := $(CFLAGS_KCSAN) -fno-omit-frame-pointer kernel/Makefile:CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) -fno-stack-protector kernel/Makefile:clean-files := kheaders_data.tar.xz kheaders.md5 kernel/sched/Makefile:# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is kernel/sched/Makefile:CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer kernel/bpf/Makefile:cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse kernel/entry/Makefile:CFLAGS_REMOVE_common.o = -fstack-protector -fstack-protector-strong kernel/entry/Makefile:CFLAGS_common.o += -fno-stack-protector kernel/rcu/Makefile:KBUILD_CFLAGS += -g -fno-omit-frame-pointer My view is: The coding of time critical functions, that access structs, that are allocated by k{m,z}alloc and family, in a short proximity in time, is, with all respect, not a good programming practice, as depicted by the example in this patch series. I have to accept, that you will not accept this patch series or a v2 thereof. But, this leaves (210 - 173) = 37 nanosecond unnecessary overhead of a syscall on the floor. It should be unnecessary to quote the number of Cascade Lake systems being deployed, how many of them that run with audit enabled and STIG component rules, to point of the significance of those nanoseconds. I would be very appreciative if you could come up with a better and more acceptable solution, in order to shave off these nanoseconds. Thxs, Håkon
On Thu, Dec 21, 2023 at 2:05 PM Haakon Bugge <haakon.bugge@oracle.com> wrote: > > On 18 Dec 2023, at 23:09, Paul Moore <paul@paul-moore.com> wrote: > > On Sat, Dec 16, 2023 at 11:28 AM Haakon Bugge <haakon.bugge@oracle.com> wrote: > >>> On 14 Dec 2023, at 00:45, Paul Moore <paul@paul-moore.com> wrote: > >>> On Tue, Dec 12, 2023 at 5:29 AM Håkon Bugge <haakon.bugge@oracle.com> wrote: > >>>> > >>>> For the most time-consuming function, when running a syscall benchmark > >>>> with STIG compliant audit rules: > >>>> > >>>> Overhead Command Shared Object Symbol > >>>> ......... ............ ................. ........................ > >>>> > >>>> 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op > >>>> > >>>> we apply codegen optimizations, which speeds up the syscall > >>>> performance by around 17% on an Intel Cascade Lake system. > >>>> > >>>> We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C > >>>> application that measures average syscall latency from getpid() > >>>> running 100 million rounds. > >>>> > >>>> Between each perf run, we reboot the system and waits until the last > >>>> minute load is less than 1.0. > >>>> > >>>> We boot the kernel, v6.6-rc4, with "mitigations=off", in order to > >>>> amplify the changes in the audit system. > >>>> > >>>> Let the base kernel be v6.6-rc4 with booted with "audit=1" and > >>>> "mitigations=off" and with the commit "audit: Vary struct audit_entry > >>>> alignment" on an Intel Cascade Lake system. The following three > >>>> metrics are reported, nanoseconds per syscall, L1D misses per syscall, > >>>> and finally Intructions Per Cycle, ipc. > >>>> > >>>> Base vs. base + this commit gives: > >>>> > >>>> ns per call: > >>>> min avg max pstdev > >>>> - 203 203 209 0.954149 > >>>> + 173 173 178 0.884534 > >>>> > >>>> L1d misses per syscall: > >>>> min avg max pstdev > >>>> - 0.012 0.103 0.817 0.238352 > >>>> + 0.010 0.209 1.235 0.399416 > >>>> > >>>> ipc: > >>>> min avg max pstdev > >>>> - 2.320 2.329 2.330 0.003000 > >>>> + 2.430 2.436 2.440 0.004899 > >>>> > >>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> > >>>> --- > >>>> kernel/auditsc.c | 2 ++ > >>>> 1 file changed, 2 insertions(+) > >>>> > >>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c > >>>> index 6f0d6fb6523fa..84d0dfe75a4ac 100644 > >>>> --- a/kernel/auditsc.c > >>>> +++ b/kernel/auditsc.c > >>>> @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > >>>> * parameter can be NULL, but all others must be specified. > >>>> * Returns 1/true if the filter finds a match, 0/false if none are found. > >>>> */ > >>>> +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16") > >>> > >>> The kernel doesn't really make use of #pragma optimization statements > >>> like this, at least not in any of the core areas, and I'm not > >>> interested in being the first to do so. I appreciate the time and > >>> effort that you have spent profiling the audit subsystem, but this > >>> isn't a patch I can accept at this point in time, I'm sorry. > >> > >> Fair enough. Will a function attribute aka: > >> > >> __attribute__((optimize("foo=bar"))) > >> > >> be acceptable for you? > > > > Unless you can show me widespread acceptance of these types of > > optimizations in core kernel code, my answer is no. I'm sorry. > > Hi Paul, > > Well, I guess we both know that I cannot show you any "widespread acceptance" of the sort. Here is what I see: > > arch/arm/lib/xor-neon.c:#pragma GCC optimize "tree-vectorize" > (not what you describe as core kernel) > > lib/zstd/common/compiler.h:# define DONT_VECTORIZE __attribute__((optimize("no-tree-vectorize"))) > (core, but not widespread) > > Then of course, there are more examples of specific optimizations or disable thereof in Makefiles: > > kernel/debug/kdb/Makefile:clean-files := gen-kdb_cmds.c > kernel/kcsan/Makefile:CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \ > kernel/kcsan/Makefile: -fno-stack-protector -DDISABLE_BRANCH_PROFILING > kernel/kcsan/Makefile:CFLAGS_kcsan_test.o := $(CFLAGS_KCSAN) -fno-omit-frame-pointer > kernel/Makefile:CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) -fno-stack-protector > kernel/Makefile:clean-files := kheaders_data.tar.xz kheaders.md5 > kernel/sched/Makefile:# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > kernel/sched/Makefile:CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer > kernel/bpf/Makefile:cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse > kernel/entry/Makefile:CFLAGS_REMOVE_common.o = -fstack-protector -fstack-protector-strong > kernel/entry/Makefile:CFLAGS_common.o += -fno-stack-protector > kernel/rcu/Makefile:KBUILD_CFLAGS += -g -fno-omit-frame-pointer > > My view is: > > The coding of time critical functions, that access structs, that are allocated by k{m,z}alloc and family, in a short proximity in time, is, with all respect, not a good programming practice, as depicted by the example in this patch series. > > I have to accept, that you will not accept this patch series or a v2 thereof. > > But, this leaves (210 - 173) = 37 nanosecond unnecessary overhead of a syscall on the floor. > > It should be unnecessary to quote the number of Cascade Lake systems being deployed, how many of them that run with audit enabled and STIG component rules, to point of the significance of those nanoseconds. > > I would be very appreciative if you could come up with a better and more acceptable solution, in order to shave off these nanoseconds. I appreciate and understand your frustration, but sometimes the subjective quality and maintainability of the code needs to take precedence, and in my mind the tradeoff for a 37ns win is not desirable. Thank you again for your time and effort.
diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 6f0d6fb6523fa..84d0dfe75a4ac 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -822,6 +822,7 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) * parameter can be NULL, but all others must be specified. * Returns 1/true if the filter finds a match, 0/false if none are found. */ +#pragma GCC optimize("unswitch-loops", "align-loops=16", "align-jumps=16") static int __audit_filter_op(struct task_struct *tsk, struct audit_context *ctx, struct list_head *list, @@ -841,6 +842,7 @@ static int __audit_filter_op(struct task_struct *tsk, } return 0; } +#pragma GCC reset_options /** * audit_filter_uring - apply filters to an io_uring operation
For the most time-consuming function, when running a syscall benchmark with STIG compliant audit rules: Overhead Command Shared Object Symbol ......... ............ ................. ........................ 27.62% syscall_lat [kernel.kallsyms] [k] __audit_filter_op we apply codegen optimizations, which speeds up the syscall performance by around 17% on an Intel Cascade Lake system. We run "perf stat -d -r 5 ./syscall_lat", where syscall_lat is a C application that measures average syscall latency from getpid() running 100 million rounds. Between each perf run, we reboot the system and waits until the last minute load is less than 1.0. We boot the kernel, v6.6-rc4, with "mitigations=off", in order to amplify the changes in the audit system. Let the base kernel be v6.6-rc4 with booted with "audit=1" and "mitigations=off" and with the commit "audit: Vary struct audit_entry alignment" on an Intel Cascade Lake system. The following three metrics are reported, nanoseconds per syscall, L1D misses per syscall, and finally Intructions Per Cycle, ipc. Base vs. base + this commit gives: ns per call: min avg max pstdev - 203 203 209 0.954149 + 173 173 178 0.884534 L1d misses per syscall: min avg max pstdev - 0.012 0.103 0.817 0.238352 + 0.010 0.209 1.235 0.399416 ipc: min avg max pstdev - 2.320 2.329 2.330 0.003000 + 2.430 2.436 2.440 0.004899 Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> --- kernel/auditsc.c | 2 ++ 1 file changed, 2 insertions(+)