diff mbox series

[linux-next,RFC] powerpc: fix HOTPLUG error in rcutorture

Message ID 20221010023315.98396-1-zhouzhouyi@gmail.com (mailing list archive)
State New, archived
Headers show
Series [linux-next,RFC] powerpc: fix HOTPLUG error in rcutorture | expand

Commit Message

Zhouyi Zhou Oct. 10, 2022, 2:33 a.m. UTC
I think we should avoid torture offline the cpu who do tick timer
when nohz full is running.

Tested on PPC VM of Open Source Lab of Oregon State University.
The test results show that after the fix, the success rate of
rcutorture is improved.
After:
Successes: 40 Failures: 9
Before:
Successes: 38 Failures: 11

I examined the console.log and Make.out files one by one, no new
compile error or test error is introduced by above fix.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
---
Dear PPC developers

I found this bug when trying to do rcutorture tests in ppc VM of
Open Source Lab of Oregon State University:

ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG
./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04

I tried to fix this bug.

Thanks for your patience and guidance ;-)

Thanks 
Zhouyi
--
 arch/powerpc/kernel/sysfs.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Oct. 10, 2022, 11:21 a.m. UTC | #1
Zhouyi Zhou <zhouzhouyi@gmail.com> writes:
> I think we should avoid torture offline the cpu who do tick timer
> when nohz full is running.

Can you tell us what the bug you're fixing is?

Did you see a crash/oops/hang etc? Or are you just proposing this as
something that would be a good idea?

> Tested on PPC VM of Open Source Lab of Oregon State University.
> The test results show that after the fix, the success rate of
> rcutorture is improved.
> After:
> Successes: 40 Failures: 9
> Before:
> Successes: 38 Failures: 11
>
> I examined the console.log and Make.out files one by one, no new
> compile error or test error is introduced by above fix.
>
> Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
> ---
> Dear PPC developers
>
> I found this bug when trying to do rcutorture tests in ppc VM of
> Open Source Lab of Oregon State University:
>
> ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG
> ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
>
> I tried to fix this bug.
>
> Thanks for your patience and guidance ;-)
>
> Thanks 
> Zhouyi
> --
>  arch/powerpc/kernel/sysfs.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index ef9a61718940..be9c0e45337e 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -4,6 +4,7 @@
>  #include <linux/smp.h>
>  #include <linux/percpu.h>
>  #include <linux/init.h>
> +#include <linux/tick.h>
>  #include <linux/sched.h>
>  #include <linux/export.h>
>  #include <linux/nodemask.h>
> @@ -21,6 +22,7 @@
>  #include <asm/firmware.h>
>  #include <asm/idle.h>
>  #include <asm/svm.h>
> +#include "../../../kernel/time/tick-internal.h"
  
Needing to include this internal header is a sign that we are using the
wrong API or otherwise using time keeping internals we shouldn't be.

>  #include "cacheinfo.h"
>  #include "setup.h"
> @@ -1151,7 +1153,11 @@ static int __init topology_init(void)
>  		 * CPU.  For instance, the boot cpu might never be valid
>  		 * for hotplugging.
>  		 */
> -		if (smp_ops && smp_ops->cpu_offline_self)
> +		if (smp_ops && smp_ops->cpu_offline_self
> +#ifdef CONFIG_NO_HZ_FULL
> +		    && !(tick_nohz_full_running && tick_do_timer_cpu == cpu)
> +#endif
> +		    )

I can't see any other arches doing anything like this. I don't think
it's the arches responsibility.

If the time keeping core needs a CPU to stay online to run the timer
then it needs to organise that itself IMHO :)

cheers

>  			c->hotpluggable = 1;
>  #endif
>  
> -- 
> 2.25.1
Zhouyi Zhou Oct. 11, 2022, 1:59 a.m. UTC | #2
Thanks Michael for reviewing my patch

On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Zhouyi Zhou <zhouzhouyi@gmail.com> writes:
> > I think we should avoid torture offline the cpu who do tick timer
> > when nohz full is running.
>
> Can you tell us what the bug you're fixing is?
>
> Did you see a crash/oops/hang etc? Or are you just proposing this as
> something that would be a good idea?
Sorry for the trouble and inconvenience that I bring to the community.
I haven't made myself clear in my patch.
The ins and outs are as follows:
1) cd linux-next
2) ./tools/testing/selftests/rcutorture/bin/torture.sh
after 19 hours ;-)
3) tail  ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log

[  121.449268][   T57] scftorture:  scf_invoked_count VER: 2415215
resched: 697463 single: 619512/619760 single_ofl: 255751/256554
single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all:
77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0
stnmoe: 0 staf: 0
[  121.454485][   T57] scftorture: --- End of test: LOCK_HOTPLUG:
verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30
onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5
use_cpus_read_lock=0, weight_resched=-1, weight_single=-1,
weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1,
weight_many_wait=-1, weight_all=-1, weight_all_wait=-1
[  121.469305][   T57] reboot: Power down

I see "End of test: LOCK_HOTPLUG", which means the function
torture_offline in kernel torture.c failed to bring down the cpu.
4) Then I chase the reason down to tick_nohz_cpu_down:
if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
      return -EBUSY;
5) I create above patch
>
> > Tested on PPC VM of Open Source Lab of Oregon State University.
> > The test results show that after the fix, the success rate of
> > rcutorture is improved.
> > After:
> > Successes: 40 Failures: 9
> > Before:
> > Successes: 38 Failures: 11
> >
> > I examined the console.log and Make.out files one by one, no new
> > compile error or test error is introduced by above fix.
> >
> > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
> > ---
> > Dear PPC developers
> >
> > I found this bug when trying to do rcutorture tests in ppc VM of
> > Open Source Lab of Oregon State University:
> >
> > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG
> > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> >
> > I tried to fix this bug.
> >
> > Thanks for your patience and guidance ;-)
> >
> > Thanks
> > Zhouyi
> > --
> >  arch/powerpc/kernel/sysfs.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > index ef9a61718940..be9c0e45337e 100644
> > --- a/arch/powerpc/kernel/sysfs.c
> > +++ b/arch/powerpc/kernel/sysfs.c
> > @@ -4,6 +4,7 @@
> >  #include <linux/smp.h>
> >  #include <linux/percpu.h>
> >  #include <linux/init.h>
> > +#include <linux/tick.h>
> >  #include <linux/sched.h>
> >  #include <linux/export.h>
> >  #include <linux/nodemask.h>
> > @@ -21,6 +22,7 @@
> >  #include <asm/firmware.h>
> >  #include <asm/idle.h>
> >  #include <asm/svm.h>
> > +#include "../../../kernel/time/tick-internal.h"
>
> Needing to include this internal header is a sign that we are using the
> wrong API or otherwise using time keeping internals we shouldn't be.
Yes, when I do this, I guess there is something wrong in my patch.
>
> >  #include "cacheinfo.h"
> >  #include "setup.h"
> > @@ -1151,7 +1153,11 @@ static int __init topology_init(void)
> >                * CPU.  For instance, the boot cpu might never be valid
> >                * for hotplugging.
> >                */
> > -             if (smp_ops && smp_ops->cpu_offline_self)
> > +             if (smp_ops && smp_ops->cpu_offline_self
> > +#ifdef CONFIG_NO_HZ_FULL
> > +                 && !(tick_nohz_full_running && tick_do_timer_cpu == cpu)
> > +#endif
> > +                 )
>
> I can't see any other arches doing anything like this. I don't think
> it's the arches responsibility.
Agree!

X86 seems to disable CPU0's hotplug by default, while
tick_do_timer_cpu has a default value 0.

42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0
43 static int cpu0_hotpluggable = 1;
44 #else
45 static int cpu0_hotpluggable;
46 static int __init enable_cpu0_hotplug(char *str)
47 {
48         cpu0_hotpluggable = 1;
49         return 1;
50 }
51
52 __setup("cpu0_hotplug", enable_cpu0_hotplug);
53 #endif

I need more time to make clear the relationship of X86's
cpu0_hotpluggable and tick_do_timer_cpu, but
I also intend to think it's time keeping the mechanism's responsibility.


>
> If the time keeping core needs a CPU to stay online to run the timer
> then it needs to organise that itself IMHO :)

Um, I am going to submit a patch to time keeping community sometime
next month ;-)

Thanks again
Cheers
Zhouyi
>
> cheers
>
> >                       c->hotpluggable = 1;
> >  #endif
> >
> > --
> > 2.25.1
Zhouyi Zhou Nov. 13, 2022, 2:35 a.m. UTC | #3
Hi,
I also reappear the same phenomenon in RISC-V:
[  120.156380] scftorture: --- End of test: LOCK_HOTPLUG

So I guess it is not the arch's responsibility.
I am very interested in it ;-)

Thank you both for your guidance!
Cheers
Zhouyi

On Tue, Oct 11, 2022 at 9:59 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
>
> Thanks Michael for reviewing my patch
>
> On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
> >
> > Zhouyi Zhou <zhouzhouyi@gmail.com> writes:
> > > I think we should avoid torture offline the cpu who do tick timer
> > > when nohz full is running.
> >
> > Can you tell us what the bug you're fixing is?
> >
> > Did you see a crash/oops/hang etc? Or are you just proposing this as
> > something that would be a good idea?
> Sorry for the trouble and inconvenience that I bring to the community.
> I haven't made myself clear in my patch.
> The ins and outs are as follows:
> 1) cd linux-next
> 2) ./tools/testing/selftests/rcutorture/bin/torture.sh
> after 19 hours ;-)
> 3) tail  ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log
>
> [  121.449268][   T57] scftorture:  scf_invoked_count VER: 2415215
> resched: 697463 single: 619512/619760 single_ofl: 255751/256554
> single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all:
> 77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0
> stnmoe: 0 staf: 0
> [  121.454485][   T57] scftorture: --- End of test: LOCK_HOTPLUG:
> verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30
> onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5
> use_cpus_read_lock=0, weight_resched=-1, weight_single=-1,
> weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1,
> weight_many_wait=-1, weight_all=-1, weight_all_wait=-1
> [  121.469305][   T57] reboot: Power down
>
> I see "End of test: LOCK_HOTPLUG", which means the function
> torture_offline in kernel torture.c failed to bring down the cpu.
> 4) Then I chase the reason down to tick_nohz_cpu_down:
> if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
>       return -EBUSY;
> 5) I create above patch
> >
> > > Tested on PPC VM of Open Source Lab of Oregon State University.
> > > The test results show that after the fix, the success rate of
> > > rcutorture is improved.
> > > After:
> > > Successes: 40 Failures: 9
> > > Before:
> > > Successes: 38 Failures: 11
> > >
> > > I examined the console.log and Make.out files one by one, no new
> > > compile error or test error is introduced by above fix.
> > >
> > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
> > > ---
> > > Dear PPC developers
> > >
> > > I found this bug when trying to do rcutorture tests in ppc VM of
> > > Open Source Lab of Oregon State University:
> > >
> > > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG
> > > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> > > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> > > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> > > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> > > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> > > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> > >
> > > I tried to fix this bug.
> > >
> > > Thanks for your patience and guidance ;-)
> > >
> > > Thanks
> > > Zhouyi
> > > --
> > >  arch/powerpc/kernel/sysfs.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > > index ef9a61718940..be9c0e45337e 100644
> > > --- a/arch/powerpc/kernel/sysfs.c
> > > +++ b/arch/powerpc/kernel/sysfs.c
> > > @@ -4,6 +4,7 @@
> > >  #include <linux/smp.h>
> > >  #include <linux/percpu.h>
> > >  #include <linux/init.h>
> > > +#include <linux/tick.h>
> > >  #include <linux/sched.h>
> > >  #include <linux/export.h>
> > >  #include <linux/nodemask.h>
> > > @@ -21,6 +22,7 @@
> > >  #include <asm/firmware.h>
> > >  #include <asm/idle.h>
> > >  #include <asm/svm.h>
> > > +#include "../../../kernel/time/tick-internal.h"
> >
> > Needing to include this internal header is a sign that we are using the
> > wrong API or otherwise using time keeping internals we shouldn't be.
> Yes, when I do this, I guess there is something wrong in my patch.
> >
> > >  #include "cacheinfo.h"
> > >  #include "setup.h"
> > > @@ -1151,7 +1153,11 @@ static int __init topology_init(void)
> > >                * CPU.  For instance, the boot cpu might never be valid
> > >                * for hotplugging.
> > >                */
> > > -             if (smp_ops && smp_ops->cpu_offline_self)
> > > +             if (smp_ops && smp_ops->cpu_offline_self
> > > +#ifdef CONFIG_NO_HZ_FULL
> > > +                 && !(tick_nohz_full_running && tick_do_timer_cpu == cpu)
> > > +#endif
> > > +                 )
> >
> > I can't see any other arches doing anything like this. I don't think
> > it's the arches responsibility.
> Agree!
>
> X86 seems to disable CPU0's hotplug by default, while
> tick_do_timer_cpu has a default value 0.
>
> 42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0
> 43 static int cpu0_hotpluggable = 1;
> 44 #else
> 45 static int cpu0_hotpluggable;
> 46 static int __init enable_cpu0_hotplug(char *str)
> 47 {
> 48         cpu0_hotpluggable = 1;
> 49         return 1;
> 50 }
> 51
> 52 __setup("cpu0_hotplug", enable_cpu0_hotplug);
> 53 #endif
>
> I need more time to make clear the relationship of X86's
> cpu0_hotpluggable and tick_do_timer_cpu, but
> I also intend to think it's time keeping the mechanism's responsibility.
>
>
> >
> > If the time keeping core needs a CPU to stay online to run the timer
> > then it needs to organise that itself IMHO :)
>
> Um, I am going to submit a patch to time keeping community sometime
> next month ;-)
>
> Thanks again
> Cheers
> Zhouyi
> >
> > cheers
> >
> > >                       c->hotpluggable = 1;
> > >  #endif
> > >
> > > --
> > > 2.25.1
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index ef9a61718940..be9c0e45337e 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -4,6 +4,7 @@ 
 #include <linux/smp.h>
 #include <linux/percpu.h>
 #include <linux/init.h>
+#include <linux/tick.h>
 #include <linux/sched.h>
 #include <linux/export.h>
 #include <linux/nodemask.h>
@@ -21,6 +22,7 @@ 
 #include <asm/firmware.h>
 #include <asm/idle.h>
 #include <asm/svm.h>
+#include "../../../kernel/time/tick-internal.h"
 
 #include "cacheinfo.h"
 #include "setup.h"
@@ -1151,7 +1153,11 @@  static int __init topology_init(void)
 		 * CPU.  For instance, the boot cpu might never be valid
 		 * for hotplugging.
 		 */
-		if (smp_ops && smp_ops->cpu_offline_self)
+		if (smp_ops && smp_ops->cpu_offline_self
+#ifdef CONFIG_NO_HZ_FULL
+		    && !(tick_nohz_full_running && tick_do_timer_cpu == cpu)
+#endif
+		    )
 			c->hotpluggable = 1;
 #endif