Message ID | CANOLnOODjTaBcL1QzAm7o4YOB=_P-s7JYovu6fhNSqJSV2Bq+Q@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Grazvydas Ignotas <notasas@gmail.com> [150908 13:44]: > On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren <tony@atomide.com> wrote: > > * Grazvydas Ignotas <notasas@gmail.com> [150908 05:50]: > >> Hi, > >> > >> this is a longstanding problem I'm seeing since the very beginning, > >> which was around 3.12 or so (when I've first got the hardware) and it > >> seems 4.2 is affected by it still. Basically what happens is Xorg > >> randomly segfaults at some "impossible" location. I don't have the > >> details at the moment (could get them is needed), but from what I > >> examined with gdb some time ago the situation did not make any sense. > >> > >> There are 2 workarounds that I know which make the problem go away > >> (one is enough): > >> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by default) > >> - disable ARCH_MULTI_V6 in the kernel config > >> > >> Because of the above workarounds I have forgotten about it several > >> times, but it regularly comes back and bites again. It would look like > >> some missing erratum workaround, but I have all of them enabled in the > >> kernel. > >> > >> Does anyone know about this? Perhaps some missing erratum workaround > >> in the bootloader? u-boot isn't too old here (2015.07). > > > > Seems like some incorrect handling with CONFIG_CPU_V6 compiled in.. > > Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and > > __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6 > > places ignoring uncompress and davinci code. > > ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6 > disabled, it is enough to just do this: > > --- a/arch/arm/kernel/signal.c > +++ b/arch/arm/kernel/signal.c > @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig, > /* > * The LSB of the handler determines if we're going to > * be using THUMB or ARM mode for this signal handler. > */ > thumb = handler & 1; > > -#if __LINUX_ARM_ARCH__ >= 7 > +#if 0 //__LINUX_ARM_ARCH__ >= 7 > /* > * Clear the If-Then Thumb-2 execution state > * ARM spec requires this to be all 000s in ARM mode > * Snapdragon S4/Krait misbehaves on a Thumb=>ARM > * signal transition without this. > */ > > ... and the problem appears, so I guess this needs some real > multiplatform handling,. OK nice to hear you found it. Yeah looks like some runtime capability check is needed. > > Do you have some easy way to reproduce this issue? > > Just moving a browser window around with mouse usually triggers it > within a minute. OK good to know. Regards, Tony
Am 08.09.2015 um 23:07 schrieb Tony Lindgren <tony@atomide.com>: > * Grazvydas Ignotas <notasas@gmail.com> [150908 13:44]: >> On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren <tony@atomide.com> wrote: >>> * Grazvydas Ignotas <notasas@gmail.com> [150908 05:50]: >>>> Hi, >>>> >>>> this is a longstanding problem I'm seeing since the very beginning, >>>> which was around 3.12 or so (when I've first got the hardware) and it >>>> seems 4.2 is affected by it still. Basically what happens is Xorg >>>> randomly segfaults at some "impossible" location. I don't have the >>>> details at the moment (could get them is needed), but from what I >>>> examined with gdb some time ago the situation did not make any sense. >>>> >>>> There are 2 workarounds that I know which make the problem go away >>>> (one is enough): >>>> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by default) >>>> - disable ARCH_MULTI_V6 in the kernel config >>>> >>>> Because of the above workarounds I have forgotten about it several >>>> times, but it regularly comes back and bites again. It would look like >>>> some missing erratum workaround, but I have all of them enabled in the >>>> kernel. >>>> >>>> Does anyone know about this? Perhaps some missing erratum workaround >>>> in the bootloader? u-boot isn't too old here (2015.07). >>> >>> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in.. >>> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and >>> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6 >>> places ignoring uncompress and davinci code. >> >> ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6 >> disabled, it is enough to just do this: >> >> --- a/arch/arm/kernel/signal.c >> +++ b/arch/arm/kernel/signal.c >> @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig, >> /* >> * The LSB of the handler determines if we're going to >> * be using THUMB or ARM mode for this signal handler. >> */ >> thumb = handler & 1; >> >> -#if __LINUX_ARM_ARCH__ >= 7 >> +#if 0 //__LINUX_ARM_ARCH__ >= 7 >> /* >> * Clear the If-Then Thumb-2 execution state >> * ARM spec requires this to be all 000s in ARM mode >> * Snapdragon S4/Krait misbehaves on a Thumb=>ARM >> * signal transition without this. >> */ >> >> ... and the problem appears, so I guess this needs some real >> multiplatform handling,. > > OK nice to hear you found it. Yeah looks like some runtime > capability check is needed. > >>> Do you have some easy way to reproduce this issue? >> >> Just moving a browser window around with mouse usually triggers it >> within a minute. > > OK good to know. It looks as if this is the solution for the same symptom on our OMAP3 board (gta04). There, it suffices to draw on the touch screen for ~10 seconds to make the xserver segfault. [we are using the binary xserver from debian wheezy ii xserver-xorg-core 2:1.12.4-6+deb7u5 armhf Xorg X server - core server] We know about this bug for a while, but so far did think that some touch screen event bit has changed and we have to fix our touch screen driver. Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the >> #if 0 //__LINUX_ARM_ARCH__ >= 7 makes it re-appear. A while ago I tried to debug running the x-server under strace and could find that it also has something to do with SIGALRM. And that is very consistent with “enable/disable” by modifying arch/arm/kernel/signal.c BR, Nikolaus
On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote: > > Am 08.09.2015 um 23:07 schrieb Tony Lindgren <tony@atomide.com>: > > > * Grazvydas Ignotas <notasas@gmail.com> [150908 13:44]: > >> On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren <tony@atomide.com> wrote: > >>> * Grazvydas Ignotas <notasas@gmail.com> [150908 05:50]: > >>>> Hi, > >>>> > >>>> this is a longstanding problem I'm seeing since the very beginning, > >>>> which was around 3.12 or so (when I've first got the hardware) and it > >>>> seems 4.2 is affected by it still. Basically what happens is Xorg > >>>> randomly segfaults at some "impossible" location. I don't have the > >>>> details at the moment (could get them is needed), but from what I > >>>> examined with gdb some time ago the situation did not make any sense. > >>>> > >>>> There are 2 workarounds that I know which make the problem go away > >>>> (one is enough): > >>>> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by default) > >>>> - disable ARCH_MULTI_V6 in the kernel config > >>>> > >>>> Because of the above workarounds I have forgotten about it several > >>>> times, but it regularly comes back and bites again. It would look like > >>>> some missing erratum workaround, but I have all of them enabled in the > >>>> kernel. > >>>> > >>>> Does anyone know about this? Perhaps some missing erratum workaround > >>>> in the bootloader? u-boot isn't too old here (2015.07). > >>> > >>> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in.. > >>> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and > >>> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6 > >>> places ignoring uncompress and davinci code. > >> > >> ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6 > >> disabled, it is enough to just do this: > >> > >> --- a/arch/arm/kernel/signal.c > >> +++ b/arch/arm/kernel/signal.c > >> @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig, > >> /* > >> * The LSB of the handler determines if we're going to > >> * be using THUMB or ARM mode for this signal handler. > >> */ > >> thumb = handler & 1; > >> > >> -#if __LINUX_ARM_ARCH__ >= 7 > >> +#if 0 //__LINUX_ARM_ARCH__ >= 7 > >> /* > >> * Clear the If-Then Thumb-2 execution state > >> * ARM spec requires this to be all 000s in ARM mode > >> * Snapdragon S4/Krait misbehaves on a Thumb=>ARM > >> * signal transition without this. > >> */ > >> > >> ... and the problem appears, so I guess this needs some real > >> multiplatform handling,. > > > > OK nice to hear you found it. Yeah looks like some runtime > > capability check is needed. > > > >>> Do you have some easy way to reproduce this issue? > >> > >> Just moving a browser window around with mouse usually triggers it > >> within a minute. > > > > OK good to know. > > It looks as if this is the solution for the same symptom on our OMAP3 board (gta04). > There, it suffices to draw on the touch screen for ~10 seconds to make the xserver segfault. > > [we are using the binary xserver from debian wheezy > ii xserver-xorg-core 2:1.12.4-6+deb7u5 armhf Xorg X server - core server] > > We know about this bug for a while, but so far did think that some touch screen > event bit has changed and we have to fix our touch screen driver. > > Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the > >> #if 0 //__LINUX_ARM_ARCH__ >= 7 > makes it re-appear. > > A while ago I tried to debug running the x-server under strace and could find that it also has > something to do with SIGALRM. > > And that is very consistent with “enable/disable” by modifying arch/arm/kernel/signal.c It would be really nice if someone could diagnose what's going on here. What exception is causing the X server to be killed (someone said a segfault)? What is the register state at the point that happens? What does the code look like Is it happening inside the SIGALRM handler, or when the SIGALRM handler has returned? I'd suggest attaching gdb to the X server, but remember to set gdb to ignore SIGPIPEs.
Am 10.09.2015 um 10:30 schrieb Russell King - ARM Linux <linux@arm.linux.org.uk>: > On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote: >> >> Am 08.09.2015 um 23:07 schrieb Tony Lindgren <tony@atomide.com>: >> >>> * Grazvydas Ignotas <notasas@gmail.com> [150908 13:44]: >>>> On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren <tony@atomide.com> wrote: >>>>> * Grazvydas Ignotas <notasas@gmail.com> [150908 05:50]: >>>>>> Hi, >>>>>> >>>>>> this is a longstanding problem I'm seeing since the very beginning, >>>>>> which was around 3.12 or so (when I've first got the hardware) and it >>>>>> seems 4.2 is affected by it still. Basically what happens is Xorg >>>>>> randomly segfaults at some "impossible" location. I don't have the >>>>>> details at the moment (could get them is needed), but from what I >>>>>> examined with gdb some time ago the situation did not make any sense. >>>>>> >>>>>> There are 2 workarounds that I know which make the problem go away >>>>>> (one is enough): >>>>>> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by default) >>>>>> - disable ARCH_MULTI_V6 in the kernel config >>>>>> >>>>>> Because of the above workarounds I have forgotten about it several >>>>>> times, but it regularly comes back and bites again. It would look like >>>>>> some missing erratum workaround, but I have all of them enabled in the >>>>>> kernel. >>>>>> >>>>>> Does anyone know about this? Perhaps some missing erratum workaround >>>>>> in the bootloader? u-boot isn't too old here (2015.07). >>>>> >>>>> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in.. >>>>> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and >>>>> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6 >>>>> places ignoring uncompress and davinci code. >>>> >>>> ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6 >>>> disabled, it is enough to just do this: >>>> >>>> --- a/arch/arm/kernel/signal.c >>>> +++ b/arch/arm/kernel/signal.c >>>> @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig, >>>> /* >>>> * The LSB of the handler determines if we're going to >>>> * be using THUMB or ARM mode for this signal handler. >>>> */ >>>> thumb = handler & 1; >>>> >>>> -#if __LINUX_ARM_ARCH__ >= 7 >>>> +#if 0 //__LINUX_ARM_ARCH__ >= 7 >>>> /* >>>> * Clear the If-Then Thumb-2 execution state >>>> * ARM spec requires this to be all 000s in ARM mode >>>> * Snapdragon S4/Krait misbehaves on a Thumb=>ARM >>>> * signal transition without this. >>>> */ >>>> >>>> ... and the problem appears, so I guess this needs some real >>>> multiplatform handling,. >>> >>> OK nice to hear you found it. Yeah looks like some runtime >>> capability check is needed. >>> >>>>> Do you have some easy way to reproduce this issue? >>>> >>>> Just moving a browser window around with mouse usually triggers it >>>> within a minute. >>> >>> OK good to know. >> >> It looks as if this is the solution for the same symptom on our OMAP3 board (gta04). >> There, it suffices to draw on the touch screen for ~10 seconds to make the xserver segfault. >> >> [we are using the binary xserver from debian wheezy >> ii xserver-xorg-core 2:1.12.4-6+deb7u5 armhf Xorg X server - core server] >> >> We know about this bug for a while, but so far did think that some touch screen >> event bit has changed and we have to fix our touch screen driver. >> >> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the >>>> #if 0 //__LINUX_ARM_ARCH__ >= 7 >> makes it re-appear. >> >> A while ago I tried to debug running the x-server under strace and could find that it also has >> something to do with SIGALRM. >> >> And that is very consistent with “enable/disable” by modifying arch/arm/kernel/signal.c > > It would be really nice if someone could diagnose what's going on here. > What exception is causing the X server to be killed (someone said a > segfault)? What is the register state at the point that happens? What > does the code look like Is it happening inside the SIGALRM handler, or > when the SIGALRM handler has returned? > > I'd suggest attaching gdb to the X server, but remember to set gdb to > ignore SIGPIPEs. I don’t have a setup to run gdb (with source) on the device and really zero experience with Xserver sources. But maybe Grazvydas can do that better than me. Attached is some strace I had recorded during my earlier experiments. X-Server appears not only to heavily use SIGALRM but SIGIO. And it looks as if it a SEGFAULT appears inside the SIGIO handler after having done 3 syscalls (select, read, clock_gettime) but before the sigreturn. At least in this example. Xserver then does a graceful shutdown after SEGFAULT. I.e. it prints the segfault message by itself. Hope this is a useful piece to solve the puzzle and helps a little. BR, Nikolaus … --- SIGALRM (Alarm clock) @ 0 (0) --- --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353T^\351\n\0\3\0\0\0:\4\0\0;\230\353T^\351\n\0\3\0\1\0=\7\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 494831541}) = 0 sigreturn() = ? (mask now [ILL ABRT KILL USR1 SEGV PIPE TERM STKFLT CHLD STOP TSTP TTIN XFSZ VTALRM PROF IO PWR RTMIN]) sigreturn() = ? (mask now []) setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 select(256, [1 3 4 5 12 13 14 15 16 19], NULL, NULL, {0, 0}) = 1 (in [19], left {0, 0}) clock_gettime(CLOCK_MONOTONIC, {7330, 499042967}) = 0 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 500050047}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 501911619}) = 0 --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353Tw\20\v\0\3\0\0\0h\4\0\0;\230\353Tw\20\v\0\3\0\1\0\256\7\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 504536131}) = 0 sigreturn() = ? (mask now [HUP QUIT ILL]) clock_gettime(CLOCK_MONOTONIC, {7330, 506275633}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 506855467}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 507587889}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 508442381}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 508961180}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 509418943}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 509998777}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 511860350}) = 0 --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353TT7\v\0\3\0\0\0\242\4\0\0;\230\353TT7\v\0\3\0\1\0\367\7\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 514484861}) = 0 sigreturn() = ? (mask now []) clock_gettime(CLOCK_MONOTONIC, {7330, 516224363}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 516743162}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 517200926}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 517719725}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 518452147}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 519367674}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 519947508}) = 0 --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353Tn^\v\0\3\0\0\0\370\4\0\0;\230\353Tn^\v\0\3\0\1\0y\10\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 525074461}) = 0 sigreturn() = ? (mask now []) setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 select(256, [1 3 4 5 12 13 14 15 16 19], NULL, NULL, {0, 0}) = 1 (in [19], left {0, 0}) clock_gettime(CLOCK_MONOTONIC, {7330, 528400877}) = 0 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 529377440}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 530018309}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 531910399}) = 0 --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353T\246\205\v\0\3\0\0\0V\5\0\0;\230\353T\246\205\v\0\3\0\1\0\336\10\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 534534910}) = 0 sigreturn() = ? (mask now [HUP QUIT ILL]) writev(20, [{"\6\0T\3\256\332o\0\345\0\0\0\3\0\0\1\0\0\0\0h\0\377\0h\0\377\0\0\1\1\0"..., 224}], 1) = 224 clock_gettime(CLOCK_MONOTONIC, {7330, 542164305}) = 0 --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353TX\255\v\0\3\0\0\0\317\5\0\0;\230\353TX\255\v\0\3\0\1\0T\t\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 546253660}) = 0 sigreturn() = ? (mask now [HUP QUIT ILL]) read(20, "5\20\4\0\236\0\0\1\3\0\0\1\33\1\257\0\224\4\6\0\237\0\0\1\236\0\0\1)\0\0\0"..., 4096) = 1088 clock_gettime(CLOCK_MONOTONIC, {7330, 548756102}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 549366453}) = 0 --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now [HUP QUIT ILL]) --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353T\273\323\v\0\3\0\0\0K\6\0\0;\230\353T\273\323\v\0\3\0\1\0\314\t\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 554707029}) = 0 sigreturn() = ? (mask now [HUP QUIT ILL]) setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 select(256, [1 3 4 5 12 13 14 15 16 19], NULL, NULL, {0, 0}) = 1 (in [19], left {0, 0}) clock_gettime(CLOCK_MONOTONIC, {7330, 558155516}) = 0 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 559132078}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 560749510}) = 0 --- SIGIO (I/O possible) @ 0 (0) --- select(12, [9 10 11], NULL, NULL, {0, 0}) = 1 (in [9], left {0, 0}) read(9, ";\230\353T\325\372\v\0\3\0\0\0\326\6\0\0;\230\353T\325\372\v\0\3\0\1\0:\n\0\0"..., 256) = 64 clock_gettime(CLOCK_MONOTONIC, {7330, 564564207}) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- write(2, "\n", 1 ) = 1 clock_gettime(CLOCK_MONOTONIC, {7330, 565968016}) = 0 write(0, "[ 7330.565] ", 13) = 13 write(0, "\n", 1) = 1 write(2, "Backtrace:\n", 11Backtrace: ) = 11 clock_gettime(CLOCK_MONOTONIC, {7330, 568195799}) = 0 write(0, "[ 7330.568] ", 13) = 13 write(0, "Backtrace:\n", 11) = 11 write(2, "\n", 1 ) = 1 clock_gettime(CLOCK_MONOTONIC, {7330, 571125486}) = 0 write(0, "[ 7330.571] ", 13) = 13 write(0, "\n", 1) = 1 futex(0xb6c587d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(2, "Segmentation fault at address (n"..., 36Segmentation fault at address (nil) ) = 36 clock_gettime(CLOCK_MONOTONIC, {7330, 575092772}) = 0 write(0, "[ 7330.575] ", 13) = 13 write(0, "Segmentation fault at address (n"..., 36) = 36 write(2, "\nFatal server error:\n", 21 Fatal server error: ) = 21 clock_gettime(CLOCK_MONOTONIC, {7330, 577412108}) = 0 write(0, "[ 7330.577] ", 13) = 13 write(0, "\nFatal server error:\n", 21) = 21 write(2, "Caught signal 11 (Segmentation f"..., 55Caught signal 11 (Segmentation fault). Server aborting ) = 55 --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now [ABRT BUS FPE USR1 SEGV USR2 ALRM STKFLT CHLD CONT TTIN TTOU URG XCPU VTALRM PROF WINCH IO PWR RTMIN]) clock_gettime(CLOCK_MONOTONIC, {7330, 582752684}) = 0 write(0, "[ 7330.582] ", 13) = 13 write(0, "Caught signal 11 (Segmentation f"..., 55) = 55 write(2, "\n", 1 ) = 1 clock_gettime(CLOCK_MONOTONIC, {7330, 585041502}) = 0 write(0, "[ 7330.585] ", 13) = 13 write(0, "\n", 1) = 1 write(2, "\nPlease consult the The X.Org Fo"..., 85 Please consult the The X.Org Foundation support at http://wiki.x.org for help. ) = 85 clock_gettime(CLOCK_MONOTONIC, {7330, 587208250}) = 0 write(0, "[ 7330.587] ", 13) = 13 write(0, "\nPlease consult the The X.Org Fo"..., 85) = 85 write(2, "Please also check the log file a"..., 84Please also check the log file at "/var/log/Xorg.0.log" for additional information. ) = 84 clock_gettime(CLOCK_MONOTONIC, {7330, 589466551}) = 0 write(0, "[ 7330.589] ", 13) = 13 write(0, "Please also check the log file a"..., 84) = 84 write(2, "\n", 1 ) = 1 clock_gettime(CLOCK_MONOTONIC, {7330, 593525389}) = 0 write(0, "[ 7330.593] ", 13) = 13 write(0, "\n", 1) = 1 close(1) = 0 close(3) = 0 close(4) = 0 close(5) = 0 unlink("/tmp/.X11-unix/X0") = 0 unlink("/tmp/.X0-lock") = 0 rt_sigprocmask(SIG_BLOCK, [ALRM CHLD TSTP TTIN TTOU VTALRM WINCH IO], [SEGV IO], 8) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 599567869}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 601948240}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 603168943}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 604145506}) = 0 fcntl64(9, F_GETFL) = 0x2802 (flags O_RDWR|O_NONBLOCK|O_ASYNC) fcntl64(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl64(9, F_GETFD) = 0 close(9) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 606983641}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 608509520}) = 0 write(0, "[ 7330.608] ", 13) = 13 write(0, "(II) evdev: Touchscreen: Close\n", 31) = 31 clock_gettime(CLOCK_MONOTONIC, {7330, 610798338}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 611408690}) = 0 write(0, "[ 7330.611] ", 13) = 13 write(0, "(II) UnloadModule: \"evdev\"\n", 27) = 27 clock_gettime(CLOCK_MONOTONIC, {7330, 613361815}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 614368895}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 615009764}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 615986326}) = 0 fcntl64(10, F_GETFL) = 0x2802 (flags O_RDWR|O_NONBLOCK|O_ASYNC) fcntl64(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl64(10, F_GETFD) = 0 close(10) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 618336180}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 619007567}) = 0 write(0, "[ 7330.619] ", 13) = 13 write(0, "(II) evdev: Power Button: Close\n", 32) = 32 clock_gettime(CLOCK_MONOTONIC, {7330, 621601561}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 622181395}) = 0 write(0, "[ 7330.622] ", 13) = 13 write(0, "(II) UnloadModule: \"evdev\"\n", 27) = 27 fcntl64(11, F_GETFL) = 0x2802 (flags O_RDWR|O_NONBLOCK|O_ASYNC) fcntl64(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl64(11, F_GETFD) = 0 rt_sigaction(SIGIO, {SIG_IGN, [IO], 0x4000000 /* SA_??? */}, {0xb6f0d63d, [IO], 0x4000000 /* SA_??? */}, 8) = 0 close(11) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 626606443}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 627308348}) = 0 write(0, "[ 7330.627] ", 13) = 13 write(0, "(II) evdev: AUX Button: Close\n", 30) = 30 clock_gettime(CLOCK_MONOTONIC, {7330, 629261473}) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 629810789}) = 0 write(0, "[ 7330.629] ", 13) = 13 write(0, "(II) UnloadModule: \"evdev\"\n", 27) = 27 rt_sigprocmask(SIG_SETMASK, [SEGV IO], NULL, 8) = 0 --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) rt_sigprocmask(SIG_BLOCK, [IO], [SEGV IO], 8) = 0 clock_gettime(CLOCK_MONOTONIC, {7330, 634663084}) = 0 write(0, "[ 7330.634] ", 13) = 13 write(0, "(NI) OMAPFBLeaveVT\n", 19) = 19 ioctl(7, KDSETMODE, 0) = 0 --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) ioctl(7, KDSKBMODE, 0x3) = 0 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 -opost -isig -icanon -echo ...}) = 0 ioctl(7, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(7, VIDIOC_RESERVED or VT_GETMODE, 0xbef3b348) = 0 ioctl(7, VIDIOC_ENUM_FMT or VT_SETMODE, 0xbef3b348) = 0 ioctl(7, VT_ACTIVATE, 0x1) = 0 ioctl(7, VT_WAITACTIVE, 0x1) = 0 close(7) = 0 write(2, "Server terminated with error (1)"..., 52Server terminated with error (1). Closing log file. ) = 52 clock_gettime(CLOCK_MONOTONIC, {7330, 655903318}) = 0 write(0, "[ 7330.655] ", 13) = 13 write(0, "Server terminated with error (1)"..., 52) = 52 close(0) = 0 rt_sigprocmask(SIG_BLOCK, [ALRM CHLD TSTP TTIN TTOU VTALRM WINCH IO], [SEGV IO], 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 tgkill(4586, 4586, SIGABRT) = 0 --- SIGABRT (Aborted) @ 0 (0) --- root@gta04:~#
> From: linux-arm-kernel [mailto:linux-arm-kernel- > bounces@lists.infradead.org] On Behalf Of Russell King - ARM Linux > > >>>> There are 2 workarounds that I know which make the problem go > > >>>> away (one is enough): > > >>>> - recompile Xorg with -marm (I'm using Debian armhf so it's > > >>>> thumb2 by default) > > >>>> - disable ARCH_MULTI_V6 in the kernel config This reminds me of a customer crash I saw quite a while ago relating to thumb2. I thought it was fixed but maybe not. In a couple spots the PSR_IT_MASK was not conditionally handled well in ARCH_MULTI_V6 flow. Some stack sanity check failed and a BUG() was triggered. Compiling the app for v6 or pulling MULTI from the kernel build solved the issue. Additionally it was not handled correctly in GDB. The old build of GDB didn't do MULTI and needed a hack to be useable on thumb2 code. Regards, Richard W.
On Thu, Sep 10, 2015 at 10:30 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote: >> ... >> >> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the >> >> #if 0 //__LINUX_ARM_ARCH__ >= 7 >> makes it re-appear. >> >> A while ago I tried to debug running the x-server under strace and could find that it also has >> something to do with SIGALRM. >> >> And that is very consistent with “enable/disable” by modifying arch/arm/kernel/signal.c > > It would be really nice if someone could diagnose what's going on here. > What exception is causing the X server to be killed (someone said a > segfault)? What is the register state at the point that happens? What > does the code look like Is it happening inside the SIGALRM handler, or > when the SIGALRM handler has returned? > > I'd suggest attaching gdb to the X server, but remember to set gdb to > ignore SIGPIPEs. It's actually pretty random, see some debug sessions in [1]. The first one is the most useful one, but I haven't though of checking what pixman_rasterize_edges() was doing when the signal arrived, and most often the "less useful" segfaults occur. However from the disassembly (see debug1_libpixman.gz) it can be seen that the signal arrived right after IT. [1] http://notaz.gp2x.de/tmp/thumb_segfault/ Gražvydas
On Fri, Sep 11, 2015 at 03:27:13PM +0200, Grazvydas Ignotas wrote: > On Thu, Sep 10, 2015 at 10:30 AM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote: > >> ... > >> > >> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the > >> >> #if 0 //__LINUX_ARM_ARCH__ >= 7 > >> makes it re-appear. > >> > >> A while ago I tried to debug running the x-server under strace and could find that it also has > >> something to do with SIGALRM. > >> > >> And that is very consistent with “enable/disable” by modifying arch/arm/kernel/signal.c > > > > It would be really nice if someone could diagnose what's going on here. > > What exception is causing the X server to be killed (someone said a > > segfault)? What is the register state at the point that happens? What > > does the code look like Is it happening inside the SIGALRM handler, or > > when the SIGALRM handler has returned? > > > > I'd suggest attaching gdb to the X server, but remember to set gdb to > > ignore SIGPIPEs. > > It's actually pretty random, see some debug sessions in [1]. > The first one is the most useful one, but I haven't though of checking > what pixman_rasterize_edges() was doing when the signal arrived, and > most often the "less useful" segfaults occur. However from the > disassembly (see debug1_libpixman.gz) it can be seen that the signal > arrived right after IT. > > [1] http://notaz.gp2x.de/tmp/thumb_segfault/ We're not going from ARM -> Thumb or Thumb -> ARM here, but Thumb code in libpixman is being interrupted calling a Thumb signal handler. Working through the code: 0x7f717ec8 <SmartScheduleTimer>: ldr r2, [pc, #20] ; = 0x0004112e 0x7f717eca <SmartScheduleTimer+2>: ldr r1, [pc, #24] ; = 0x00000c48 0x7f717ecc <SmartScheduleTimer+4>: ldr r3, [pc, #24] ; = 0x00000e6c 0x7f717ece <SmartScheduleTimer+6>: add r2, pc 0x7f717ed0 <SmartScheduleTimer+8>: ldr r1, [r2, r1] 0x7f717ed2 <SmartScheduleTimer+10>: ldr r3, [r2, r3] => 0x7f717ed4 <SmartScheduleTimer+12>: ldr r2, [r1, #0] The instruction at 0x7f717ed4 was trying to access 0xd1242963 which is in kernel space, and this is the faulting instruction. At this point, r2 should contain 0x0004112e plus the PC value. r2 in the register dump was 0x7f717fa0. Let's calculate the value that PC should be here. 0x7f717fa0 - 0x0004112e = 0x7f6d6e72, which is clearly wrong. So, I don't think the first instruction here was executed by the CPU. gdb indicates that the parent context to the signal frame, pc was at 0xb6dd87f8, which works out at 0x297f8 into the libpixman-1 library: 297f0: 449c add ip, r3 297f2: f1bc 0fff cmp.w ip, #255 ; 0xff 297f6: bfd4 ite le 297f8: fa5f fc8c uxtble.w ip, ip 297fc: f04f 0cff movgt.w ip, #255 ; 0xff 29800: f88a c000 strb.w ip, [sl] and as you say, is just after an IT instruction, which would have set the IT execution state to appropriately skip either the first or the second instruction. Unfortunately, the IT instruction's condition is being carried forward to the signal handler, causing either the first or second instruction there to be skipped. Looking back at the history, the original commit introducing the clearing of the PSR_IT_MASK bits is just wrong: - if (thumb) + if (thumb) { cpsr |= PSR_T_BIT; - else +#if __LINUX_ARM_ARCH__ >= 7 + /* clear the If-Then Thumb-2 execution state */ + cpsr &= ~PSR_IT_MASK; +#endif + } else cpsr &= ~PSR_T_BIT; This shouldn't be a compile-time decision at all, and it certainly should not be dependent on __LINUX_ARM_ARCH__, which marks the _lowest_ supported architecture. However, even the idea that it's ARMv7 or later is wrong. According to the ARM ARM, the IT instruction is present in ARMv6T2 as well, which means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6). Looking at the ARM ARM, these bits are "reserved" in previous non-T2 architectures, have an undefined value at reset, and are probably zero anyway. Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem, and I doubt there's any ARMv6 non-T2 systems out there that would be affected by clearing the IT state bits.
> From: linux-omap-owner@vger.kernel.org [mailto:linux-omap- > owner@vger.kernel.org] On Behalf Of Russell King - ARM Linux > Sent: Friday, September 11, 2015 9:03 AM > To: Grazvydas Ignotas > However, even the idea that it's ARMv7 or later is wrong. According to > the ARM ARM, the IT instruction is present in ARMv6T2 as well, which > means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6). I recall seeing ARMv6T2 first implemented in the ARM1156 which is a v6 CPU with T2 option added. Cortex-R class was the ARMv7 successor to the 1156 CPU which also use T2. > Looking at the ARM ARM, these bits are "reserved" in previous non-T2 > architectures, have an undefined value at reset, and are probably zero > anyway. > > Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the > problem, > and I doubt there's any ARMv6 non-T2 systems out there that would be > affected by clearing the IT state bits. Probably you already looked, but cpsr.it usage is not restricted to this one spot. Looking back at old notes I think both debug and signal handler code keyed on bit usage. I see from LXR kernel KVM code also uses in some capacity. The 1156/Cortex-R are typically MMU-less. They may (or not) have something else to consider when fixing. Regards, Richard W.
On Fri, Sep 11, 2015 at 04:12:21PM +0000, Woodruff, Richard wrote: > > From: linux-omap-owner@vger.kernel.org [mailto:linux-omap- > > owner@vger.kernel.org] On Behalf Of Russell King - ARM Linux > > Sent: Friday, September 11, 2015 9:03 AM > > To: Grazvydas Ignotas > > > However, even the idea that it's ARMv7 or later is wrong. According to > > the ARM ARM, the IT instruction is present in ARMv6T2 as well, which > > means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6). > > I recall seeing ARMv6T2 first implemented in the ARM1156 which is a > v6 CPU with T2 option added. Exactly, which is why we need to be dealing with the IT bits in signal handling for >= ARMv6, not >= ARMv7. > > Looking at the ARM ARM, these bits are "reserved" in previous non-T2 > > architectures, have an undefined value at reset, and are probably zero > > anyway. > > > > Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the > > problem, > > and I doubt there's any ARMv6 non-T2 systems out there that would be > > affected by clearing the IT state bits. > > Probably you already looked, but cpsr.it usage is not restricted to this > one spot. Other places: arch/arm/mm/extable.c-#ifdef CONFIG_THUMB2_KERNEL arch/arm/mm/extable.c- /* Clear the IT state to avoid nasty surprises in the fixup */ arch/arm/mm/extable.c: regs->ARM_cpsr &= ~PSR_IT_MASK; arch/arm/mm/extable.c-#endif which is irrelevant here. This code only deals with kernel mode, and the only time that this makes sense is when the kernel is built using Thumb2 instructions. CONFIG_THUMB2_KERNEL covers the case properly. arch/arm/probes/kprobes/test-core.c- regs->ARM_lr = val ^ (14 << 8); arch/arm/probes/kprobes/test-core.c: regs->ARM_cpsr &= ~(APSR_MASK | PSR_IT_MASK); arch/arm/probes/kprobes/test-core.c- regs->ARM_cpsr |= test_context_cpsr(scenario); From what I can see, this happens unconditionally. KVM and Xen code... that requires virtualisation support, which is ARMv7. arch/arm/probes/kprobes/actions-thumb.c... emulating an IT instruction. arch/arm/probes/decode.h::it_advance... emulating Thumb2. So really there's no other places that need fixing. > Looking back at old notes I think both debug and signal handler code > keyed on bit usage. I see from LXR kernel KVM code also uses in some > capacity. Frankly, Richard, you're getting on my nerves in this thread - you seem to know all about this problem, yet you never reported the problem upstream, so people are effectively having to waste time re-doing the work that you've already done. Nothing annoys me more than having people say "oh yes, I found that problem and worked on it" and nothing coming of it (no report, no patch, no nothing.) As you have "old notes" you've already investigated this issue, and presumably you came up with a patch. Where is it?
> From: Russell King - ARM Linux [mailto:linux@arm.linux.org.uk] > Sent: Friday, September 11, 2015 12:49 PM > Frankly, Richard, you're getting on my nerves in this thread - you seem to > know all about this problem, yet you never reported the problem upstream, > so people are effectively having to waste time re-doing the work that you've > already done. > > Nothing annoys me more than having people say "oh yes, I found that > problem and worked on it" and nothing coming of it (no report, no patch, no > nothing.) Yes, when I put out the hint (to help speed resolution) I expected there might be some negative interpretation. When I originally hit the issue, I did pass along information to folks who work in the area with expectation they would follow through. Probably it got lost. When I noticed this thread, it appeared like the CPSR.IT information didn't make it out, so I directly posted what I recalled. > As you have "old notes" you've already investigated this issue, and > presumably you came up with a patch. Where is it? I didn't generate a comprehensive one. I did a couple of hack versions but was unsure in some of the areas your analysis has cleared... for that issue I ended up advising a reversion of MULTI_V6 for that older kernel. Regards, Richard W.
Hi Grazvydas, * Tony Lindgren <tony@atomide.com> [150908 14:11]: > * Grazvydas Ignotas <notasas@gmail.com> [150908 13:44]: > > On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren <tony@atomide.com> wrote: > OK nice to hear you found it. Yeah looks like some runtime > capability check is needed. > > > > Do you have some easy way to reproduce this issue? > > > > Just moving a browser window around with mouse usually triggers it > > within a minute. > > OK good to know. Just FYI, I too was now able to produce it here too moving around icewweasel for about a minute. And can confirm Russell's patch fixes the problem. I'm using i3 tiling window manager here, and don't usually ever have any floating windows which probably explains why I did not run into this issue earlier with my lapdock experiments :) Regards, Tony
--- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig, /* * The LSB of the handler determines if we're going to * be using THUMB or ARM mode for this signal handler. */ thumb = handler & 1; -#if __LINUX_ARM_ARCH__ >= 7 +#if 0 //__LINUX_ARM_ARCH__ >= 7 /* * Clear the If-Then Thumb-2 execution state * ARM spec requires this to be all 000s in ARM mode * Snapdragon S4/Krait misbehaves on a Thumb=>ARM * signal transition without this. */