diff mbox series

[v2] kallsyms: Fix sleeping function called from invalid context when CONFIG_KALLSYMS_SELFTEST=y

Message ID 20221228014511.328-1-thunder.leizhen@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v2] kallsyms: Fix sleeping function called from invalid context when CONFIG_KALLSYMS_SELFTEST=y | expand

Commit Message

Leizhen (ThunderTown) Dec. 28, 2022, 1:45 a.m. UTC
[T58] BUG: sleeping function called from invalid context at kernel/kallsyms.c:305
[T58] in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 58, name: kallsyms_test
[T58] preempt_count: 0, expected: 0
[T58] RCU nest depth: 0, expected: 0
[T58] no locks held by kallsyms_test/58.
[T58] irq event stamp: 18899904
[T58] hardirqs last enabled at (18899903): finish_task_switch.isra.0 (core.c:?)
[T58] hardirqs last disabled at (18899904): test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
[T58] softirqs last enabled at (18899886): __do_softirq (??:?)
[T58] softirqs last disabled at (18899879): ____do_softirq (irq.c:?)
[T58] CPU: 0 PID: 58 Comm: kallsyms_test Tainted: G T  6.1.0-next-20221215 #2
[T58] Hardware name: linux,dummy-virt (DT)
[T58] Call trace:
[T58] dump_backtrace (??:?)
[T58] show_stack (??:?)
[T58] dump_stack_lvl (??:?)
[T58] dump_stack (??:?)
[T58] __might_resched (??:?)
[T58] kallsyms_on_each_symbol (??:?)
[T58] test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
[T58] test_entry (kallsyms_selftest.c:?)
[T58] kthread (kthread.c:?)
[T58] ret_from_fork (??:?)
[T58] kallsyms_selftest: kallsyms_on_each_symbol() traverse all: 5744310840 ns
[T58] kallsyms_selftest: kallsyms_on_each_match_symbol() traverse all: 1164580 ns
[T58] kallsyms_selftest: finish

The execution time of function kallsyms_on_each_match_symbol() is very
short, about ten microseconds, the probability of this process being
interrupted is very small. And even if it happens, we just have to try
again.

The execution time of function kallsyms_on_each_symbol() is very long,
it takes tens of milliseconds, context switches is likely occur during
this period. If the time obtained by task_cputime() is accurate, it is
preferred. Otherwise, use local_clock() directly, and the time taken by
irqs and high-priority tasks is not deducted because they are always
running for a short time.

Fixes: 30f3bb09778d ("kallsyms: Add self-test facility")
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 kernel/kallsyms_selftest.c | 52 +++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 15 deletions(-)

v1 --> v2:
1. Keep calling cond_resched() when CONFIG_KALLSYMS_SELFTEST=y. Instead,
   function kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol()
   are not protected by local_irq_save() in kallsyms_selftest.c.

Comments

Luis Chamberlain Jan. 6, 2023, 6:25 p.m. UTC | #1
On Wed, Dec 28, 2022 at 09:45:11AM +0800, Zhen Lei wrote:
> [T58] BUG: sleeping function called from invalid context at kernel/kallsyms.c:305
> [T58] in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 58, name: kallsyms_test
> [T58] preempt_count: 0, expected: 0
> [T58] RCU nest depth: 0, expected: 0
> [T58] no locks held by kallsyms_test/58.
> [T58] irq event stamp: 18899904
> [T58] hardirqs last enabled at (18899903): finish_task_switch.isra.0 (core.c:?)
> [T58] hardirqs last disabled at (18899904): test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
> [T58] softirqs last enabled at (18899886): __do_softirq (??:?)
> [T58] softirqs last disabled at (18899879): ____do_softirq (irq.c:?)
> [T58] CPU: 0 PID: 58 Comm: kallsyms_test Tainted: G T  6.1.0-next-20221215 #2
> [T58] Hardware name: linux,dummy-virt (DT)
> [T58] Call trace:
> [T58] dump_backtrace (??:?)
> [T58] show_stack (??:?)
> [T58] dump_stack_lvl (??:?)
> [T58] dump_stack (??:?)
> [T58] __might_resched (??:?)
> [T58] kallsyms_on_each_symbol (??:?)
> [T58] test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
> [T58] test_entry (kallsyms_selftest.c:?)
> [T58] kthread (kthread.c:?)
> [T58] ret_from_fork (??:?)
> [T58] kallsyms_selftest: kallsyms_on_each_symbol() traverse all: 5744310840 ns
> [T58] kallsyms_selftest: kallsyms_on_each_match_symbol() traverse all: 1164580 ns
> [T58] kallsyms_selftest: finish
> 
> The execution time of function kallsyms_on_each_match_symbol() is very
> short, about ten microseconds, the probability of this process being
> interrupted is very small. And even if it happens, we just have to try
> again.
> 
> The execution time of function kallsyms_on_each_symbol() is very long,
> it takes tens of milliseconds, context switches is likely occur during
> this period. If the time obtained by task_cputime() is accurate, it is
> preferred. Otherwise, use local_clock() directly, and the time taken by
> irqs and high-priority tasks is not deducted because they are always
> running for a short time.
> 
> Fixes: 30f3bb09778d ("kallsyms: Add self-test facility")
> Reported-by: Anders Roxell <anders.roxell@linaro.org>
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  kernel/kallsyms_selftest.c | 52 +++++++++++++++++++++++++++-----------
>  1 file changed, 37 insertions(+), 15 deletions(-)
> 
> v1 --> v2:
> 1. Keep calling cond_resched() when CONFIG_KALLSYMS_SELFTEST=y. Instead,
>    function kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol()
>    are not protected by local_irq_save() in kallsyms_selftest.c.
> 
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> @@ -270,17 +283,26 @@ static int match_symbol(void *data, unsigned long addr)
>  static void test_perf_kallsyms_on_each_match_symbol(void)
>  {
>  	u64 t0, t1;
> -	unsigned long flags;
> +	int cpu = smp_processor_id();
> +	unsigned long nr_irqs;
>  	struct test_stat stat;
>  
>  	memset(&stat, 0, sizeof(stat));
>  	stat.max = INT_MAX;
>  	stat.name = stub_name;
> -	local_irq_save(flags);
> -	t0 = sched_clock();
> -	kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
> -	t1 = sched_clock();
> -	local_irq_restore(flags);
> +
> +	/*
> +	 * The test thread has been bound to a fixed CPU in advance. If the
> +	 * number of irqs does not change, no new scheduling request will be
> +	 * generated. That is, the performance test process is atomic.
> +	 */
> +	do {
> +		nr_irqs = kstat_cpu_irqs_sum(cpu);
> +		cond_resched();
> +		t0 = local_clock();
> +		kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
> +		t1 = local_clock();
> +	} while (nr_irqs != kstat_cpu_irqs_sum(cpu));
>  	pr_info("kallsyms_on_each_match_symbol() traverse all: %lld ns\n", t1 - t0);

But can't we bump the number of IRQs, preempt, handle the IRQ and come
back to the same CPU with the same IRQ count and end up with a very very
false positive on delta?

  Luis
Leizhen (ThunderTown) Jan. 9, 2023, 3:07 a.m. UTC | #2
On 2023/1/7 2:25, Luis Chamberlain wrote:
> On Wed, Dec 28, 2022 at 09:45:11AM +0800, Zhen Lei wrote:
>> [T58] BUG: sleeping function called from invalid context at kernel/kallsyms.c:305
>> [T58] in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 58, name: kallsyms_test
>> [T58] preempt_count: 0, expected: 0
>> [T58] RCU nest depth: 0, expected: 0
>> [T58] no locks held by kallsyms_test/58.
>> [T58] irq event stamp: 18899904
>> [T58] hardirqs last enabled at (18899903): finish_task_switch.isra.0 (core.c:?)
>> [T58] hardirqs last disabled at (18899904): test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
>> [T58] softirqs last enabled at (18899886): __do_softirq (??:?)
>> [T58] softirqs last disabled at (18899879): ____do_softirq (irq.c:?)
>> [T58] CPU: 0 PID: 58 Comm: kallsyms_test Tainted: G T  6.1.0-next-20221215 #2
>> [T58] Hardware name: linux,dummy-virt (DT)
>> [T58] Call trace:
>> [T58] dump_backtrace (??:?)
>> [T58] show_stack (??:?)
>> [T58] dump_stack_lvl (??:?)
>> [T58] dump_stack (??:?)
>> [T58] __might_resched (??:?)
>> [T58] kallsyms_on_each_symbol (??:?)
>> [T58] test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
>> [T58] test_entry (kallsyms_selftest.c:?)
>> [T58] kthread (kthread.c:?)
>> [T58] ret_from_fork (??:?)
>> [T58] kallsyms_selftest: kallsyms_on_each_symbol() traverse all: 5744310840 ns
>> [T58] kallsyms_selftest: kallsyms_on_each_match_symbol() traverse all: 1164580 ns
>> [T58] kallsyms_selftest: finish
>>
>> The execution time of function kallsyms_on_each_match_symbol() is very
>> short, about ten microseconds, the probability of this process being
>> interrupted is very small. And even if it happens, we just have to try
>> again.
>>
>> The execution time of function kallsyms_on_each_symbol() is very long,
>> it takes tens of milliseconds, context switches is likely occur during
>> this period. If the time obtained by task_cputime() is accurate, it is
>> preferred. Otherwise, use local_clock() directly, and the time taken by
>> irqs and high-priority tasks is not deducted because they are always
>> running for a short time.
>>
>> Fixes: 30f3bb09778d ("kallsyms: Add self-test facility")
>> Reported-by: Anders Roxell <anders.roxell@linaro.org>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  kernel/kallsyms_selftest.c | 52 +++++++++++++++++++++++++++-----------
>>  1 file changed, 37 insertions(+), 15 deletions(-)
>>
>> v1 --> v2:
>> 1. Keep calling cond_resched() when CONFIG_KALLSYMS_SELFTEST=y. Instead,
>>    function kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol()
>>    are not protected by local_irq_save() in kallsyms_selftest.c.
>>
>> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
>> @@ -270,17 +283,26 @@ static int match_symbol(void *data, unsigned long addr)
>>  static void test_perf_kallsyms_on_each_match_symbol(void)
>>  {
>>  	u64 t0, t1;
>> -	unsigned long flags;
>> +	int cpu = smp_processor_id();
>> +	unsigned long nr_irqs;
>>  	struct test_stat stat;
>>  
>>  	memset(&stat, 0, sizeof(stat));
>>  	stat.max = INT_MAX;
>>  	stat.name = stub_name;
>> -	local_irq_save(flags);
>> -	t0 = sched_clock();
>> -	kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
>> -	t1 = sched_clock();
>> -	local_irq_restore(flags);
>> +
>> +	/*
>> +	 * The test thread has been bound to a fixed CPU in advance. If the
>> +	 * number of irqs does not change, no new scheduling request will be
>> +	 * generated. That is, the performance test process is atomic.
>> +	 */
>> +	do {
>> +		nr_irqs = kstat_cpu_irqs_sum(cpu);
>> +		cond_resched();
>> +		t0 = local_clock();
>> +		kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
>> +		t1 = local_clock();
>> +	} while (nr_irqs != kstat_cpu_irqs_sum(cpu));
>>  	pr_info("kallsyms_on_each_match_symbol() traverse all: %lld ns\n", t1 - t0);
> 
> But can't we bump the number of IRQs, preempt, handle the IRQ and come
> back to the same CPU with the same IRQ count and end up with a very very
> false positive on delta?

Do you mean the total time minus the processing time of interrupts? It seems
to be a little more complicated to implement. Because soft interrupts and task
switches may also be happened. Although the current implementation contains a
while loop, it actually retries at most one more time. And I think it's easier
to understand, without worrying about the side effects of the corners.

> 
>   Luis
> .
>
Petr Mladek Jan. 9, 2023, 1:40 p.m. UTC | #3
On Wed 2022-12-28 09:45:11, Zhen Lei wrote:
> [T58] BUG: sleeping function called from invalid context at kernel/kallsyms.c:305
> [T58] in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 58, name: kallsyms_test
> [T58] preempt_count: 0, expected: 0
> [T58] RCU nest depth: 0, expected: 0
> [T58] no locks held by kallsyms_test/58.
> [T58] irq event stamp: 18899904
> [T58] hardirqs last enabled at (18899903): finish_task_switch.isra.0 (core.c:?)
> [T58] hardirqs last disabled at (18899904): test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
> [T58] softirqs last enabled at (18899886): __do_softirq (??:?)
> [T58] softirqs last disabled at (18899879): ____do_softirq (irq.c:?)
> [T58] CPU: 0 PID: 58 Comm: kallsyms_test Tainted: G T  6.1.0-next-20221215 #2
> [T58] Hardware name: linux,dummy-virt (DT)
> [T58] Call trace:
> [T58] dump_backtrace (??:?)
> [T58] show_stack (??:?)
> [T58] dump_stack_lvl (??:?)
> [T58] dump_stack (??:?)
> [T58] __might_resched (??:?)
> [T58] kallsyms_on_each_symbol (??:?)
> [T58] test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
> [T58] test_entry (kallsyms_selftest.c:?)
> [T58] kthread (kthread.c:?)
> [T58] ret_from_fork (??:?)
> [T58] kallsyms_selftest: kallsyms_on_each_symbol() traverse all: 5744310840 ns
> [T58] kallsyms_selftest: kallsyms_on_each_match_symbol() traverse all: 1164580 ns
> [T58] kallsyms_selftest: finish
> 
> The execution time of function kallsyms_on_each_match_symbol() is very
> short, about ten microseconds, the probability of this process being
> interrupted is very small. And even if it happens, we just have to try
> again.
> 
> The execution time of function kallsyms_on_each_symbol() is very long,
> it takes tens of milliseconds, context switches is likely occur during
> this period. If the time obtained by task_cputime() is accurate, it is
> preferred. Otherwise, use local_clock() directly, and the time taken by
> irqs and high-priority tasks is not deducted because they are always
> running for a short time.
> 
> Fixes: 30f3bb09778d ("kallsyms: Add self-test facility")
> Reported-by: Anders Roxell <anders.roxell@linaro.org>
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  kernel/kallsyms_selftest.c | 52 +++++++++++++++++++++++++++-----------
>  1 file changed, 37 insertions(+), 15 deletions(-)
> 
> v1 --> v2:
> 1. Keep calling cond_resched() when CONFIG_KALLSYMS_SELFTEST=y. Instead,
>    function kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol()
>    are not protected by local_irq_save() in kallsyms_selftest.c.
> 
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> index f35d9cc1aab1544..9c94f06aa951971 100644
> --- a/kernel/kallsyms_selftest.c
> +++ b/kernel/kallsyms_selftest.c
> @@ -12,9 +12,11 @@
>  #include <linux/init.h>
>  #include <linux/module.h>
>  #include <linux/kallsyms.h>
> +#include <linux/kernel_stat.h>
> +#include <linux/kthread.h>
>  #include <linux/random.h>
>  #include <linux/sched/clock.h>
> -#include <linux/kthread.h>
> +#include <linux/sched/cputime.h>
>  #include <linux/vmalloc.h>
>  
>  #include "kallsyms_internal.h"
> @@ -161,9 +163,9 @@ static int lookup_name(void *data, const char *name, struct module *mod, unsigne
>  	struct test_stat *stat = (struct test_stat *)data;
>  
>  	local_irq_save(flags);
> -	t0 = sched_clock();
> +	t0 = local_clock();
>  	(void)kallsyms_lookup_name(name);
> -	t1 = sched_clock();
> +	t1 = local_clock();
>  	local_irq_restore(flags);
>  
>  	t = t1 - t0;
> @@ -233,19 +235,30 @@ static int find_symbol(void *data, const char *name, struct module *mod, unsigne
>  
>  static void test_perf_kallsyms_on_each_symbol(void)
>  {
> -	u64 t0, t1;
> -	unsigned long flags;
> +	bool accurate;
> +	u64 utime, t0, t1;
>  	struct test_stat stat;
>  
>  	memset(&stat, 0, sizeof(stat));
>  	stat.max = INT_MAX;
>  	stat.name = stub_name;
>  	stat.perf = 1;
> -	local_irq_save(flags);
> -	t0 = sched_clock();
> +
> +	/*
> +	 * This test process takes tens of milliseconds, context switches may
> +	 * occur during this period. If task_cputime() returns true, it is
> +	 * accurate enough. Otherwise, use local_clock() directly. This is
> +	 * based on the assumption that irqs and high-priority tasks are always
> +	 * running for a short time, they may cause less error.
> +	 */
> +	accurate = task_cputime(current, &utime, &t0);
> +	if (!accurate)
> +		t0 = local_clock();
>  	kallsyms_on_each_symbol(find_symbol, &stat);
> -	t1 = sched_clock();
> -	local_irq_restore(flags);
> +	if (accurate)
> +		task_cputime(current, &utime, &t1);
> +	else
> +		t1 = local_clock();
>  	pr_info("kallsyms_on_each_symbol() traverse all: %lld ns\n", t1 - t0);
>  }
>  
> @@ -270,17 +283,26 @@ static int match_symbol(void *data, unsigned long addr)
>  static void test_perf_kallsyms_on_each_match_symbol(void)
>  {
>  	u64 t0, t1;
> -	unsigned long flags;
> +	int cpu = smp_processor_id();
> +	unsigned long nr_irqs;
>  	struct test_stat stat;
>  
>  	memset(&stat, 0, sizeof(stat));
>  	stat.max = INT_MAX;
>  	stat.name = stub_name;
> -	local_irq_save(flags);
> -	t0 = sched_clock();
> -	kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
> -	t1 = sched_clock();
> -	local_irq_restore(flags);
> +
> +	/*
> +	 * The test thread has been bound to a fixed CPU in advance. If the
> +	 * number of irqs does not change, no new scheduling request will be
> +	 * generated. That is, the performance test process is atomic.
> +	 */
> +	do {
> +		nr_irqs = kstat_cpu_irqs_sum(cpu);
> +		cond_resched();
> +		t0 = local_clock();
> +		kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
> +		t1 = local_clock();
> +	} while (nr_irqs != kstat_cpu_irqs_sum(cpu));

Huh, is this guaranteed to ever finish?

What if there is a regression and kallsyms_on_each_match_symbol()
never finishes without rescheduling?

This is yet another unreliable hack.


Use standard solution:

I did the homework for you and checked how the "time" command
measures the time spend in the system. It actually uses more methods.

One is times() syscall. It uses thread_group_cputime_adjusted(), see
do_sys_times() in kernel/sys.c

Or it uses wait4() syscall to get struct rusage that provides this
information.

Please, stop inventing crazy hacks, and use these standard methods.
If the "time" tool is enough for userspace performance tests
then it must be enough in this case as well.


Or remove this test:

Seriously, what is the value of this test?
Is anyone going to use it at all in the future?

The code was useful when developing the optimization.
But it is really questionable as a selftest.

Selftests usually check if the code works as expected.
This test provides some number that is hardly comparable.


Why are try hardly comparable?

1. The speed depends on the number of loaded modules
   and number of symbols. It highly depends on the configuration
   that was used to build the kernel.

2. The test runs only once. As a result it is hard to judge
   how big is the noise.

3. The noise might depend on the size and state of CPU caches.


I personally vote for removing this selftest!

Best Regards,
Petr
Luis Chamberlain Jan. 10, 2023, 12:12 a.m. UTC | #4
On Mon, Jan 09, 2023 at 02:40:27PM +0100, Petr Mladek wrote:
> Why are try hardly comparable?
> 
> 1. The speed depends on the number of loaded modules
>    and number of symbols. It highly depends on the configuration
>    that was used to build the kernel.
> 
> 2. The test runs only once. As a result it is hard to judge
>    how big is the noise.
> 
> 3. The noise might depend on the size and state of CPU caches.
> 
> 
> I personally vote for removing this selftest!

Even so, just as with testing a filesystem with different types of
configurations, at least testing a few configs helps and it's what
we do. Then, if anyone ever wanted to try to increase performance
on symbol lookup today they have no easy way to measure things. How
would they go about comparing things performance without this selftest?

This selftests helps generically with that *and* helps peg on to it any sanity
checks you may wish to add to those APIs which we just don't want to do
upstream.

That was the rationale behind it, just as with any other selftest.

However, if measuring the time is not possible that's separate topic.
But to say that measuring time on some config is not valuable I think
is not a fair statement. Yes, the noise things are good points, but
the test can be enhanced for that too. And any patch which anyone
in the future would try to propose for new enhancements would likely
be looking at obvious gains beyond any type of noise.

  Luis
Leizhen (ThunderTown) Jan. 10, 2023, 4:05 a.m. UTC | #5
On 2023/1/9 21:40, Petr Mladek wrote:
> On Wed 2022-12-28 09:45:11, Zhen Lei wrote:
>> [T58] BUG: sleeping function called from invalid context at kernel/kallsyms.c:305
>> [T58] in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 58, name: kallsyms_test
>> [T58] preempt_count: 0, expected: 0
>> [T58] RCU nest depth: 0, expected: 0
>> [T58] no locks held by kallsyms_test/58.
>> [T58] irq event stamp: 18899904
>> [T58] hardirqs last enabled at (18899903): finish_task_switch.isra.0 (core.c:?)
>> [T58] hardirqs last disabled at (18899904): test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
>> [T58] softirqs last enabled at (18899886): __do_softirq (??:?)
>> [T58] softirqs last disabled at (18899879): ____do_softirq (irq.c:?)
>> [T58] CPU: 0 PID: 58 Comm: kallsyms_test Tainted: G T  6.1.0-next-20221215 #2
>> [T58] Hardware name: linux,dummy-virt (DT)
>> [T58] Call trace:
>> [T58] dump_backtrace (??:?)
>> [T58] show_stack (??:?)
>> [T58] dump_stack_lvl (??:?)
>> [T58] dump_stack (??:?)
>> [T58] __might_resched (??:?)
>> [T58] kallsyms_on_each_symbol (??:?)
>> [T58] test_perf_kallsyms_on_each_symbol (kallsyms_selftest.c:?)
>> [T58] test_entry (kallsyms_selftest.c:?)
>> [T58] kthread (kthread.c:?)
>> [T58] ret_from_fork (??:?)
>> [T58] kallsyms_selftest: kallsyms_on_each_symbol() traverse all: 5744310840 ns
>> [T58] kallsyms_selftest: kallsyms_on_each_match_symbol() traverse all: 1164580 ns
>> [T58] kallsyms_selftest: finish
>>
>> The execution time of function kallsyms_on_each_match_symbol() is very
>> short, about ten microseconds, the probability of this process being
>> interrupted is very small. And even if it happens, we just have to try
>> again.
>>
>> The execution time of function kallsyms_on_each_symbol() is very long,
>> it takes tens of milliseconds, context switches is likely occur during
>> this period. If the time obtained by task_cputime() is accurate, it is
>> preferred. Otherwise, use local_clock() directly, and the time taken by
>> irqs and high-priority tasks is not deducted because they are always
>> running for a short time.
>>
>> Fixes: 30f3bb09778d ("kallsyms: Add self-test facility")
>> Reported-by: Anders Roxell <anders.roxell@linaro.org>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  kernel/kallsyms_selftest.c | 52 +++++++++++++++++++++++++++-----------
>>  1 file changed, 37 insertions(+), 15 deletions(-)
>>
>> v1 --> v2:
>> 1. Keep calling cond_resched() when CONFIG_KALLSYMS_SELFTEST=y. Instead,
>>    function kallsyms_on_each_match_symbol() and kallsyms_on_each_symbol()
>>    are not protected by local_irq_save() in kallsyms_selftest.c.
>>
>> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
>> index f35d9cc1aab1544..9c94f06aa951971 100644
>> --- a/kernel/kallsyms_selftest.c
>> +++ b/kernel/kallsyms_selftest.c
>> @@ -12,9 +12,11 @@
>>  #include <linux/init.h>
>>  #include <linux/module.h>
>>  #include <linux/kallsyms.h>
>> +#include <linux/kernel_stat.h>
>> +#include <linux/kthread.h>
>>  #include <linux/random.h>
>>  #include <linux/sched/clock.h>
>> -#include <linux/kthread.h>
>> +#include <linux/sched/cputime.h>
>>  #include <linux/vmalloc.h>
>>  
>>  #include "kallsyms_internal.h"
>> @@ -161,9 +163,9 @@ static int lookup_name(void *data, const char *name, struct module *mod, unsigne
>>  	struct test_stat *stat = (struct test_stat *)data;
>>  
>>  	local_irq_save(flags);
>> -	t0 = sched_clock();
>> +	t0 = local_clock();
>>  	(void)kallsyms_lookup_name(name);
>> -	t1 = sched_clock();
>> +	t1 = local_clock();
>>  	local_irq_restore(flags);
>>  
>>  	t = t1 - t0;
>> @@ -233,19 +235,30 @@ static int find_symbol(void *data, const char *name, struct module *mod, unsigne
>>  
>>  static void test_perf_kallsyms_on_each_symbol(void)
>>  {
>> -	u64 t0, t1;
>> -	unsigned long flags;
>> +	bool accurate;
>> +	u64 utime, t0, t1;
>>  	struct test_stat stat;
>>  
>>  	memset(&stat, 0, sizeof(stat));
>>  	stat.max = INT_MAX;
>>  	stat.name = stub_name;
>>  	stat.perf = 1;
>> -	local_irq_save(flags);
>> -	t0 = sched_clock();
>> +
>> +	/*
>> +	 * This test process takes tens of milliseconds, context switches may
>> +	 * occur during this period. If task_cputime() returns true, it is
>> +	 * accurate enough. Otherwise, use local_clock() directly. This is
>> +	 * based on the assumption that irqs and high-priority tasks are always
>> +	 * running for a short time, they may cause less error.
>> +	 */
>> +	accurate = task_cputime(current, &utime, &t0);
>> +	if (!accurate)
>> +		t0 = local_clock();
>>  	kallsyms_on_each_symbol(find_symbol, &stat);
>> -	t1 = sched_clock();
>> -	local_irq_restore(flags);
>> +	if (accurate)
>> +		task_cputime(current, &utime, &t1);
>> +	else
>> +		t1 = local_clock();
>>  	pr_info("kallsyms_on_each_symbol() traverse all: %lld ns\n", t1 - t0);
>>  }
>>  
>> @@ -270,17 +283,26 @@ static int match_symbol(void *data, unsigned long addr)
>>  static void test_perf_kallsyms_on_each_match_symbol(void)
>>  {
>>  	u64 t0, t1;
>> -	unsigned long flags;
>> +	int cpu = smp_processor_id();
>> +	unsigned long nr_irqs;
>>  	struct test_stat stat;
>>  
>>  	memset(&stat, 0, sizeof(stat));
>>  	stat.max = INT_MAX;
>>  	stat.name = stub_name;
>> -	local_irq_save(flags);
>> -	t0 = sched_clock();
>> -	kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
>> -	t1 = sched_clock();
>> -	local_irq_restore(flags);
>> +
>> +	/*
>> +	 * The test thread has been bound to a fixed CPU in advance. If the
>> +	 * number of irqs does not change, no new scheduling request will be
>> +	 * generated. That is, the performance test process is atomic.
>> +	 */
>> +	do {
>> +		nr_irqs = kstat_cpu_irqs_sum(cpu);
>> +		cond_resched();
>> +		t0 = local_clock();
>> +		kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
>> +		t1 = local_clock();
>> +	} while (nr_irqs != kstat_cpu_irqs_sum(cpu));
> 
> Huh, is this guaranteed to ever finish?
> 
> What if there is a regression and kallsyms_on_each_match_symbol()
> never finishes without rescheduling?

The execution time of kallsyms_on_each_match_symbol() does not exceed 10 us.
Assume that an interrupt is generated every 100 us(10000 interrupts are generated
per second, it is very high). In this case, interrupts and high-priority tasks need
to run for more than 90 us each time to cause the loop cannot exit normally.
However, the CPU usage is 90+%, which is unreasonable.

Function kallsyms_on_each_symbol() takes a long time, about 20 milliseconds, and
I'm already using task_cputime().

> 
> This is yet another unreliable hack.
> 
> 
> Use standard solution:
> 
> I did the homework for you and checked how the "time" command
> measures the time spend in the system. It actually uses more methods.
> 
> One is times() syscall. It uses thread_group_cputime_adjusted(), see
> do_sys_times() in kernel/sys.c
> 
> Or it uses wait4() syscall to get struct rusage that provides this
> information.
> 
> Please, stop inventing crazy hacks, and use these standard methods.
> If the "time" tool is enough for userspace performance tests
> then it must be enough in this case as well.

Okay, I'll study in depth, just worried about the different precision
requirements.

By the way, we still have to actively embrace new things.

In fact, I've thought of another way, which is to measure nine times,
sort in ascending order, and take the middle one. Based on probabilistic
statistics, the intermediate results are reliable.

> 
> 
> Or remove this test:
> 
> Seriously, what is the value of this test?
> Is anyone going to use it at all in the future?

There's really someone interested in performance data.
https://lkml.org/lkml/2022/12/15/134

> 
> The code was useful when developing the optimization.
> But it is really questionable as a selftest.
> 
> Selftests usually check if the code works as expected.
> This test provides some number that is hardly comparable.
> 
> 
> Why are try hardly comparable?
> 
> 1. The speed depends on the number of loaded modules
>    and number of symbols. It highly depends on the configuration
>    that was used to build the kernel.
> 
> 2. The test runs only once. As a result it is hard to judge
>    how big is the noise.
> 
> 3. The noise might depend on the size and state of CPU caches.
> 
> 
> I personally vote for removing this selftest!
> 
> Best Regards,
> Petr
> .
>
Petr Mladek Jan. 10, 2023, 9:46 a.m. UTC | #6
On Tue 2023-01-10 12:05:20, Leizhen (ThunderTown) wrote:
> 
> 
> On 2023/1/9 21:40, Petr Mladek wrote:
> > On Wed 2022-12-28 09:45:11, Zhen Lei wrote:
> >> [T58] BUG: sleeping function called from invalid context at kernel/kallsyms.c:305
> >> The execution time of function kallsyms_on_each_match_symbol() is very
> >> short, about ten microseconds, the probability of this process being
> >> interrupted is very small. And even if it happens, we just have to try
> >> again.
> >>
> >> The execution time of function kallsyms_on_each_symbol() is very long,
> >> it takes tens of milliseconds, context switches is likely occur during
> >> this period. If the time obtained by task_cputime() is accurate, it is
> >> preferred. Otherwise, use local_clock() directly, and the time taken by
> >> irqs and high-priority tasks is not deducted because they are always
> >> running for a short time.
> >>
> >> --- a/kernel/kallsyms_selftest.c
> >> +++ b/kernel/kallsyms_selftest.c
> >> +	/*
> >> +	 * The test thread has been bound to a fixed CPU in advance. If the
> >> +	 * number of irqs does not change, no new scheduling request will be
> >> +	 * generated. That is, the performance test process is atomic.
> >> +	 */
> >> +	do {
> >> +		nr_irqs = kstat_cpu_irqs_sum(cpu);
> >> +		cond_resched();
> >> +		t0 = local_clock();
> >> +		kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
> >> +		t1 = local_clock();
> >> +	} while (nr_irqs != kstat_cpu_irqs_sum(cpu));
> > 
> > Huh, is this guaranteed to ever finish?
> > 
> > What if there is a regression and kallsyms_on_each_match_symbol()
> > never finishes without rescheduling?
> 
> The execution time of kallsyms_on_each_match_symbol() does not exceed 10 us.
> Assume that an interrupt is generated every 100 us(10000 interrupts are generated
> per second, it is very high). In this case, interrupts and high-priority tasks need
> to run for more than 90 us each time to cause the loop cannot exit normally.
> However, the CPU usage is 90+%, which is unreasonable.
> 
> Function kallsyms_on_each_symbol() takes a long time, about 20 milliseconds, and
> I'm already using task_cputime().

IMHO, this is a wrong mind set.

After all, this tests a function that was optimized because it took to
long. The function even contains cond_resched() because it caused
problems. I know that kallsyms_on_each_match_symbol() is
faster because it searches only one module but still.

The cond_resched() called before taking t0 is just a horrible
non-reliable hack.

I have seen many problematic non-reliable tests that relayed
on timing. They just hooped for the best. I am sure that we could
do better.

> > This is yet another unreliable hack.
> > 
> > 
> > Use standard solution:
> > 
> > I did the homework for you and checked how the "time" command
> > measures the time spend in the system. It actually uses more methods.
> > 
> > One is times() syscall. It uses thread_group_cputime_adjusted(), see
> > do_sys_times() in kernel/sys.c
> > 
> > Or it uses wait4() syscall to get struct rusage that provides this
> > information.
> > 
> > Please, stop inventing crazy hacks, and use these standard methods.
> > If the "time" tool is enough for userspace performance tests
> > then it must be enough in this case as well.
> 
> Okay, I'll study in depth, just worried about the different precision
> requirements.

If the problem is a timer precision then the solution would be
to repeat the operation more times or use better clock source.
It is actually a good practice to repeat the action more times
in performance tests because it helps to reduce noise.

Well, thread_group_cputime_adjusted() works with
struct task_cputime. It seems to store the time in nano-seconds,
see include/linux/sched/types.h. I might be wrong but I would expect
that it is accurate and precise enough.

> By the way, we still have to actively embrace new things.
> 
> In fact, I've thought of another way, which is to measure nine times,
> sort in ascending order, and take the middle one. Based on probabilistic
> statistics, the intermediate results are reliable.
> > 
> > Or remove this test:
> > 
> > Seriously, what is the value of this test?
> > Is anyone going to use it at all in the future?
> 
> There's really someone interested in performance data.
> https://lkml.org/lkml/2022/12/15/134

I know. But this code had very bad performance for years
and nobody cared. I am not sure if anyone would care
about the info printed by this selftest.

I am not sure if it is worth the effort. We have already
spent quite some time with attempts to make it usable
and are still not there.

Best Regards,
Petr
Petr Mladek Jan. 10, 2023, 9:57 a.m. UTC | #7
On Mon 2023-01-09 16:12:53, Luis Chamberlain wrote:
> On Mon, Jan 09, 2023 at 02:40:27PM +0100, Petr Mladek wrote:
> > Why are try hardly comparable?
> > 
> > 1. The speed depends on the number of loaded modules
> >    and number of symbols. It highly depends on the configuration
> >    that was used to build the kernel.
> > 
> > 2. The test runs only once. As a result it is hard to judge
> >    how big is the noise.
> > 
> > 3. The noise might depend on the size and state of CPU caches.
> > 
> > 
> > I personally vote for removing this selftest!
> 
> Even so, just as with testing a filesystem with different types of
> configurations, at least testing a few configs helps and it's what
> we do. Then, if anyone ever wanted to try to increase performance
> on symbol lookup today they have no easy way to measure things. How
> would they go about comparing things performance without this selftest?

How many people cares about kallsyms performance, please?
Is it worth spending time one implementing and maintaining such a
selftest?

Yes, Zhen wanted to make it faster. But how likely will anyone else
try to make it even better? Do we need to spend time on this
in the meantime?


> This selftests helps generically with that *and* helps peg on to it any sanity
> checks you may wish to add to those APIs which we just don't want to do
> upstream.

From my POV, it would be much more important to check if the API
works as expected. I mean that it gives the right results.

I am not sure that performance is that important to spend more time
on this one.

Also I am not sure if selftests are the right location for performance
tests. My understanding is that it is a framework for functional
testing. It is showing if the tests passed or not. But performance
tests do not give "pass or not" results.

Best Regards,
Petr
Leizhen (ThunderTown) Jan. 10, 2023, 12:20 p.m. UTC | #8
On 2023/1/10 17:57, Petr Mladek wrote:
> On Mon 2023-01-09 16:12:53, Luis Chamberlain wrote:
>> On Mon, Jan 09, 2023 at 02:40:27PM +0100, Petr Mladek wrote:
>>> Why are try hardly comparable?
>>>
>>> 1. The speed depends on the number of loaded modules
>>>    and number of symbols. It highly depends on the configuration
>>>    that was used to build the kernel.
>>>
>>> 2. The test runs only once. As a result it is hard to judge
>>>    how big is the noise.
>>>
>>> 3. The noise might depend on the size and state of CPU caches.
>>>
>>>
>>> I personally vote for removing this selftest!
>>
>> Even so, just as with testing a filesystem with different types of
>> configurations, at least testing a few configs helps and it's what
>> we do. Then, if anyone ever wanted to try to increase performance
>> on symbol lookup today they have no easy way to measure things. How
>> would they go about comparing things performance without this selftest?
> 
> How many people cares about kallsyms performance, please?
> Is it worth spending time one implementing and maintaining such a
> selftest?
> 
> Yes, Zhen wanted to make it faster. But how likely will anyone else
> try to make it even better? Do we need to spend time on this
> in the meantime?
> 
> 
>> This selftests helps generically with that *and* helps peg on to it any sanity
>> checks you may wish to add to those APIs which we just don't want to do
>> upstream.
> 
>>From my POV, it would be much more important to check if the API
> works as expected. I mean that it gives the right results.
> 
> I am not sure that performance is that important to spend more time
> on this one.
> 
> Also I am not sure if selftests are the right location for performance
> tests. My understanding is that it is a framework for functional
> testing. It is showing if the tests passed or not. But performance
> tests do not give "pass or not" results.

I think both of you have a point. In the absence of a better way to
improve it, perhaps deleting the performance test is the best option now.
OK, I'll do this first.

> 
> Best Regards,
> Petr
> .
>
Luis Chamberlain Jan. 10, 2023, 5:12 p.m. UTC | #9
On Tue, Jan 10, 2023 at 10:57:21AM +0100, Petr Mladek wrote:
> On Mon 2023-01-09 16:12:53, Luis Chamberlain wrote:
> > On Mon, Jan 09, 2023 at 02:40:27PM +0100, Petr Mladek wrote:
> > > Why are try hardly comparable?
> > > 
> > > 1. The speed depends on the number of loaded modules
> > >    and number of symbols. It highly depends on the configuration
> > >    that was used to build the kernel.
> > > 
> > > 2. The test runs only once. As a result it is hard to judge
> > >    how big is the noise.
> > > 
> > > 3. The noise might depend on the size and state of CPU caches.
> > > 
> > > 
> > > I personally vote for removing this selftest!
> > 
> > Even so, just as with testing a filesystem with different types of
> > configurations, at least testing a few configs helps and it's what
> > we do. Then, if anyone ever wanted to try to increase performance
> > on symbol lookup today they have no easy way to measure things. How
> > would they go about comparing things performance without this selftest?
> 
> How many people cares about kallsyms performance, please?
> Is it worth spending time one implementing and maintaining such a
> selftest?

If someone is willing to *try* put the effort to do it as they are optimizing
it, then by all means it is welcomed effort.

> Yes, Zhen wanted to make it faster. But how likely will anyone else
> try to make it even better? Do we need to spend time on this
> in the meantime?

I can't say.

> > This selftests helps generically with that *and* helps peg on to it any sanity
> > checks you may wish to add to those APIs which we just don't want to do
> > upstream.
> 
> From my POV, it would be much more important to check if the API
> works as expected. I mean that it gives the right results.

Sure, but that's just one aspect of it. And before the selftests we
didn't have that either.

> I am not sure that performance is that important to spend more time
> on this one.
> 
> Also I am not sure if selftests are the right location for performance
> tests. My understanding is that it is a framework for functional
> testing. It is showing if the tests passed or not. But performance
> tests do not give "pass or not" results.

Sefltests have no rules, you do what you want, for your own use. It
is up to you for your own subsystem.

But I do agree that if we want a performance data, it should be reliable
and so if that cannot be done today then best just remove it until it
can be done.

  Luis
diff mbox series

Patch

diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index f35d9cc1aab1544..9c94f06aa951971 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -12,9 +12,11 @@ 
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/kernel_stat.h>
+#include <linux/kthread.h>
 #include <linux/random.h>
 #include <linux/sched/clock.h>
-#include <linux/kthread.h>
+#include <linux/sched/cputime.h>
 #include <linux/vmalloc.h>
 
 #include "kallsyms_internal.h"
@@ -161,9 +163,9 @@  static int lookup_name(void *data, const char *name, struct module *mod, unsigne
 	struct test_stat *stat = (struct test_stat *)data;
 
 	local_irq_save(flags);
-	t0 = sched_clock();
+	t0 = local_clock();
 	(void)kallsyms_lookup_name(name);
-	t1 = sched_clock();
+	t1 = local_clock();
 	local_irq_restore(flags);
 
 	t = t1 - t0;
@@ -233,19 +235,30 @@  static int find_symbol(void *data, const char *name, struct module *mod, unsigne
 
 static void test_perf_kallsyms_on_each_symbol(void)
 {
-	u64 t0, t1;
-	unsigned long flags;
+	bool accurate;
+	u64 utime, t0, t1;
 	struct test_stat stat;
 
 	memset(&stat, 0, sizeof(stat));
 	stat.max = INT_MAX;
 	stat.name = stub_name;
 	stat.perf = 1;
-	local_irq_save(flags);
-	t0 = sched_clock();
+
+	/*
+	 * This test process takes tens of milliseconds, context switches may
+	 * occur during this period. If task_cputime() returns true, it is
+	 * accurate enough. Otherwise, use local_clock() directly. This is
+	 * based on the assumption that irqs and high-priority tasks are always
+	 * running for a short time, they may cause less error.
+	 */
+	accurate = task_cputime(current, &utime, &t0);
+	if (!accurate)
+		t0 = local_clock();
 	kallsyms_on_each_symbol(find_symbol, &stat);
-	t1 = sched_clock();
-	local_irq_restore(flags);
+	if (accurate)
+		task_cputime(current, &utime, &t1);
+	else
+		t1 = local_clock();
 	pr_info("kallsyms_on_each_symbol() traverse all: %lld ns\n", t1 - t0);
 }
 
@@ -270,17 +283,26 @@  static int match_symbol(void *data, unsigned long addr)
 static void test_perf_kallsyms_on_each_match_symbol(void)
 {
 	u64 t0, t1;
-	unsigned long flags;
+	int cpu = smp_processor_id();
+	unsigned long nr_irqs;
 	struct test_stat stat;
 
 	memset(&stat, 0, sizeof(stat));
 	stat.max = INT_MAX;
 	stat.name = stub_name;
-	local_irq_save(flags);
-	t0 = sched_clock();
-	kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
-	t1 = sched_clock();
-	local_irq_restore(flags);
+
+	/*
+	 * The test thread has been bound to a fixed CPU in advance. If the
+	 * number of irqs does not change, no new scheduling request will be
+	 * generated. That is, the performance test process is atomic.
+	 */
+	do {
+		nr_irqs = kstat_cpu_irqs_sum(cpu);
+		cond_resched();
+		t0 = local_clock();
+		kallsyms_on_each_match_symbol(match_symbol, stat.name, &stat);
+		t1 = local_clock();
+	} while (nr_irqs != kstat_cpu_irqs_sum(cpu));
 	pr_info("kallsyms_on_each_match_symbol() traverse all: %lld ns\n", t1 - t0);
 }