Message ID | 20150311191928.GA14695@morn.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Kevin O'Connor (kevin@koconnor.net) wrote: > On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: > > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: > > > For what it's worth, I can't seem to trigger the problem if I move the > > > cmos read above the SIPI/LAPIC code (see patch below). > > > > Ugh! > > > > That's a seabios bug. Main processor modifies the rtc index > > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc > > index (romlayout.S:transition32). > > > > I'll put together a fix. > > The seabios patch below resolves the issue for me. Thanks! Looks good here. Andrey, Paolo, Bandan: Does it fix it for you as well? Dave > -Kevin > > > --- a/src/romlayout.S > +++ b/src/romlayout.S > @@ -22,7 +22,8 @@ > // %edx = return location (in 32bit mode) > // Clobbers: ecx, flags, segment registers, cr0, idt/gdt > DECLFUNC transition32 > -transition32_for_smi: > +transition32_nmi_off: > + // transition32 when NMI and A20 are already initialized > movl %eax, %ecx > jmp 1f > transition32: > @@ -205,7 +206,7 @@ __farcall16: > entry_smi: > // Transition to 32bit mode. > movl $1f + BUILD_BIOS_ADDR, %edx > - jmp transition32_for_smi > + jmp transition32_nmi_off > .code32 > 1: movl $BUILD_SMM_ADDR + 0x8000, %esp > calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR > @@ -216,8 +217,10 @@ entry_smi: > DECLFUNC entry_smp > entry_smp: > // Transition to 32bit mode. > + cli > + cld > movl $2f + BUILD_BIOS_ADDR, %edx > - jmp transition32 > + jmp transition32_nmi_off > .code32 > // Acquire lock and take ownership of shared stack > 1: rep ; nop -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes: > * Kevin O'Connor (kevin@koconnor.net) wrote: >> On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: >> > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: >> > > For what it's worth, I can't seem to trigger the problem if I move the >> > > cmos read above the SIPI/LAPIC code (see patch below). >> > >> > Ugh! >> > >> > That's a seabios bug. Main processor modifies the rtc index >> > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc >> > index (romlayout.S:transition32). >> > >> > I'll put together a fix. >> >> The seabios patch below resolves the issue for me. > > Thanks! Looks good here. > > Andrey, Paolo, Bandan: Does it fix it for you as well? Works for me too, thanks Kevin! Bandan > Dave > >> -Kevin >> >> >> --- a/src/romlayout.S >> +++ b/src/romlayout.S >> @@ -22,7 +22,8 @@ >> // %edx = return location (in 32bit mode) >> // Clobbers: ecx, flags, segment registers, cr0, idt/gdt >> DECLFUNC transition32 >> -transition32_for_smi: >> +transition32_nmi_off: >> + // transition32 when NMI and A20 are already initialized >> movl %eax, %ecx >> jmp 1f >> transition32: >> @@ -205,7 +206,7 @@ __farcall16: >> entry_smi: >> // Transition to 32bit mode. >> movl $1f + BUILD_BIOS_ADDR, %edx >> - jmp transition32_for_smi >> + jmp transition32_nmi_off >> .code32 >> 1: movl $BUILD_SMM_ADDR + 0x8000, %esp >> calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR >> @@ -216,8 +217,10 @@ entry_smi: >> DECLFUNC entry_smp >> entry_smp: >> // Transition to 32bit mode. >> + cli >> + cld >> movl $2f + BUILD_BIOS_ADDR, %edx >> - jmp transition32 >> + jmp transition32_nmi_off >> .code32 >> // Acquire lock and take ownership of shared stack >> 1: rep ; nop > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > * Kevin O'Connor (kevin@koconnor.net) wrote: >> On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: >> > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: >> > > For what it's worth, I can't seem to trigger the problem if I move the >> > > cmos read above the SIPI/LAPIC code (see patch below). >> > >> > Ugh! >> > >> > That's a seabios bug. Main processor modifies the rtc index >> > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc >> > index (romlayout.S:transition32). >> > >> > I'll put together a fix. >> >> The seabios patch below resolves the issue for me. > > Thanks! Looks good here. > > Andrey, Paolo, Bandan: Does it fix it for you as well? > Thanks Kevin, Dave, I`m afraid that I`m hitting something different not only because different suberror code but also because of mine version of seabios - I am using 1.7.5 and corresponding code in the proposed patch looks different - there is no smp-related code patch is about of. Those mentioned devices went to production successfully and I`m afraid I cannot afford playing on them anymore, even if I re-trigger the issue with patched 1.8.1-rc, there is no way to switch to a different kernel and retest due to specific conditions of this production suite. I`ve ordered a pair of new shoes^W 2620v2-s which should arrive to me next Monday, so I`ll be able to test a) against 1.8.0-release, b) against patched bios code, c) reproduce initial error on master/3.19 (may be I`ll take them before weekend by going into this computer shop in person). Until then, I have a very deep feeling that mine issue is not there :) Also I became very curious on how a lack of IDT feature may completely eliminate the issue appearance for me, the only possible explanation is a clock-related race which is kinda stupid suggestion and unlikely to exist in nature. Thanks again for everyone for throughout testing and ideas! > >> -Kevin >> >> >> --- a/src/romlayout.S >> +++ b/src/romlayout.S >> @@ -22,7 +22,8 @@ >> // %edx = return location (in 32bit mode) >> // Clobbers: ecx, flags, segment registers, cr0, idt/gdt >> DECLFUNC transition32 >> -transition32_for_smi: >> +transition32_nmi_off: >> + // transition32 when NMI and A20 are already initialized >> movl %eax, %ecx >> jmp 1f >> transition32: >> @@ -205,7 +206,7 @@ __farcall16: >> entry_smi: >> // Transition to 32bit mode. >> movl $1f + BUILD_BIOS_ADDR, %edx >> - jmp transition32_for_smi >> + jmp transition32_nmi_off >> .code32 >> 1: movl $BUILD_SMM_ADDR + 0x8000, %esp >> calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR >> @@ -216,8 +217,10 @@ entry_smi: >> DECLFUNC entry_smp >> entry_smp: >> // Transition to 32bit mode. >> + cli >> + cld >> movl $2f + BUILD_BIOS_ADDR, %edx >> - jmp transition32 >> + jmp transition32_nmi_off >> .code32 >> // Acquire lock and take ownership of shared stack >> 1: rep ; nop > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Andrey Korolyov (andrey@xdel.ru) wrote: > On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert > <dgilbert@redhat.com> wrote: > > * Kevin O'Connor (kevin@koconnor.net) wrote: > >> On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: > >> > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: > >> > > For what it's worth, I can't seem to trigger the problem if I move the > >> > > cmos read above the SIPI/LAPIC code (see patch below). > >> > > >> > Ugh! > >> > > >> > That's a seabios bug. Main processor modifies the rtc index > >> > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc > >> > index (romlayout.S:transition32). > >> > > >> > I'll put together a fix. > >> > >> The seabios patch below resolves the issue for me. > > > > Thanks! Looks good here. > > > > Andrey, Paolo, Bandan: Does it fix it for you as well? > > > > Thanks Kevin, Dave, > > I`m afraid that I`m hitting something different not only because > different suberror code but also because of mine version of seabios - > I am using 1.7.5 and corresponding code in the proposed patch looks > different - there is no smp-related code patch is about of. Those > mentioned devices went to production successfully and I`m afraid I > cannot afford playing on them anymore, even if I re-trigger the issue > with patched 1.8.1-rc, there is no way to switch to a different kernel > and retest due to specific conditions of this production suite. I`ve > ordered a pair of new shoes^W 2620v2-s which should arrive to me next Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case was pretty simple. If you can suggest any flags I should add etc to the test I'd be happy to give it a go. Dave > Monday, so I`ll be able to test a) against 1.8.0-release, b) against > patched bios code, c) reproduce initial error on master/3.19 (may be > I`ll take them before weekend by going into this computer shop in > person). Until then, I have a very deep feeling that mine issue is not > there :) Also I became very curious on how a lack of IDT feature may > completely eliminate the issue appearance for me, the only possible > explanation is a clock-related race which is kinda stupid suggestion > and unlikely to exist in nature. > > Thanks again for everyone for throughout testing and ideas! > > > > >> -Kevin > >> > >> > >> --- a/src/romlayout.S > >> +++ b/src/romlayout.S > >> @@ -22,7 +22,8 @@ > >> // %edx = return location (in 32bit mode) > >> // Clobbers: ecx, flags, segment registers, cr0, idt/gdt > >> DECLFUNC transition32 > >> -transition32_for_smi: > >> +transition32_nmi_off: > >> + // transition32 when NMI and A20 are already initialized > >> movl %eax, %ecx > >> jmp 1f > >> transition32: > >> @@ -205,7 +206,7 @@ __farcall16: > >> entry_smi: > >> // Transition to 32bit mode. > >> movl $1f + BUILD_BIOS_ADDR, %edx > >> - jmp transition32_for_smi > >> + jmp transition32_nmi_off > >> .code32 > >> 1: movl $BUILD_SMM_ADDR + 0x8000, %esp > >> calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR > >> @@ -216,8 +217,10 @@ entry_smi: > >> DECLFUNC entry_smp > >> entry_smp: > >> // Transition to 32bit mode. > >> + cli > >> + cld > >> movl $2f + BUILD_BIOS_ADDR, %edx > >> - jmp transition32 > >> + jmp transition32_nmi_off > >> .code32 > >> // Acquire lock and take ownership of shared stack > >> 1: rep ; nop > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 11, 2015 at 10:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > * Andrey Korolyov (andrey@xdel.ru) wrote: >> On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert >> <dgilbert@redhat.com> wrote: >> > * Kevin O'Connor (kevin@koconnor.net) wrote: >> >> On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: >> >> > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: >> >> > > For what it's worth, I can't seem to trigger the problem if I move the >> >> > > cmos read above the SIPI/LAPIC code (see patch below). >> >> > >> >> > Ugh! >> >> > >> >> > That's a seabios bug. Main processor modifies the rtc index >> >> > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc >> >> > index (romlayout.S:transition32). >> >> > >> >> > I'll put together a fix. >> >> >> >> The seabios patch below resolves the issue for me. >> > >> > Thanks! Looks good here. >> > >> > Andrey, Paolo, Bandan: Does it fix it for you as well? >> > >> >> Thanks Kevin, Dave, >> >> I`m afraid that I`m hitting something different not only because >> different suberror code but also because of mine version of seabios - >> I am using 1.7.5 and corresponding code in the proposed patch looks >> different - there is no smp-related code patch is about of. Those >> mentioned devices went to production successfully and I`m afraid I >> cannot afford playing on them anymore, even if I re-trigger the issue >> with patched 1.8.1-rc, there is no way to switch to a different kernel >> and retest due to specific conditions of this production suite. I`ve >> ordered a pair of new shoes^W 2620v2-s which should arrive to me next > > Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case > was pretty simple. If you can suggest any flags I should add etc to the > test I'd be happy to give it a go. > > Dave Here is mine launch string: qemu-system-x86_64 -enable-kvm -name vmtest -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -nodefaults -device sga -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m 512,slots=31,maxmem=16384M -object memory-backend-ram,id=mem0,size=512M -device pc-dimm,id=dimm0,node=0,memdev=mem0 I omitted disk backend in this example, but there is a chance that my problem is not reproducible without some calls made explicitly by a bootloader (not sure what to say for mid-runtime failures). > >> Monday, so I`ll be able to test a) against 1.8.0-release, b) against >> patched bios code, c) reproduce initial error on master/3.19 (may be >> I`ll take them before weekend by going into this computer shop in >> person). Until then, I have a very deep feeling that mine issue is not >> there :) Also I became very curious on how a lack of IDT feature may >> completely eliminate the issue appearance for me, the only possible >> explanation is a clock-related race which is kinda stupid suggestion >> and unlikely to exist in nature. >> >> Thanks again for everyone for throughout testing and ideas! >> >> > >> >> -Kevin >> >> >> >> >> >> --- a/src/romlayout.S >> >> +++ b/src/romlayout.S >> >> @@ -22,7 +22,8 @@ >> >> // %edx = return location (in 32bit mode) >> >> // Clobbers: ecx, flags, segment registers, cr0, idt/gdt >> >> DECLFUNC transition32 >> >> -transition32_for_smi: >> >> +transition32_nmi_off: >> >> + // transition32 when NMI and A20 are already initialized >> >> movl %eax, %ecx >> >> jmp 1f >> >> transition32: >> >> @@ -205,7 +206,7 @@ __farcall16: >> >> entry_smi: >> >> // Transition to 32bit mode. >> >> movl $1f + BUILD_BIOS_ADDR, %edx >> >> - jmp transition32_for_smi >> >> + jmp transition32_nmi_off >> >> .code32 >> >> 1: movl $BUILD_SMM_ADDR + 0x8000, %esp >> >> calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR >> >> @@ -216,8 +217,10 @@ entry_smi: >> >> DECLFUNC entry_smp >> >> entry_smp: >> >> // Transition to 32bit mode. >> >> + cli >> >> + cld >> >> movl $2f + BUILD_BIOS_ADDR, %edx >> >> - jmp transition32 >> >> + jmp transition32_nmi_off >> >> .code32 >> >> // Acquire lock and take ownership of shared stack >> >> 1: rep ; nop >> > -- >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Andrey Korolyov (andrey@xdel.ru) wrote: > On Wed, Mar 11, 2015 at 10:59 PM, Dr. David Alan Gilbert > <dgilbert@redhat.com> wrote: > > * Andrey Korolyov (andrey@xdel.ru) wrote: > >> On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert > >> <dgilbert@redhat.com> wrote: > >> > * Kevin O'Connor (kevin@koconnor.net) wrote: > >> >> On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: > >> >> > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: > >> >> > > For what it's worth, I can't seem to trigger the problem if I move the > >> >> > > cmos read above the SIPI/LAPIC code (see patch below). > >> >> > > >> >> > Ugh! > >> >> > > >> >> > That's a seabios bug. Main processor modifies the rtc index > >> >> > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc > >> >> > index (romlayout.S:transition32). > >> >> > > >> >> > I'll put together a fix. > >> >> > >> >> The seabios patch below resolves the issue for me. > >> > > >> > Thanks! Looks good here. > >> > > >> > Andrey, Paolo, Bandan: Does it fix it for you as well? > >> > > >> > >> Thanks Kevin, Dave, > >> > >> I`m afraid that I`m hitting something different not only because > >> different suberror code but also because of mine version of seabios - > >> I am using 1.7.5 and corresponding code in the proposed patch looks > >> different - there is no smp-related code patch is about of. Those > >> mentioned devices went to production successfully and I`m afraid I > >> cannot afford playing on them anymore, even if I re-trigger the issue > >> with patched 1.8.1-rc, there is no way to switch to a different kernel > >> and retest due to specific conditions of this production suite. I`ve > >> ordered a pair of new shoes^W 2620v2-s which should arrive to me next > > > > Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case > > was pretty simple. If you can suggest any flags I should add etc to the > > test I'd be happy to give it a go. > > > > Dave > > Here is mine launch string: > > qemu-system-x86_64 -enable-kvm -name vmtest -S -machine > pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512 > -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa > node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -nodefaults > -device sga -rtc base=utc,driftfix=slew -global > kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global > PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on > -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device > virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m > 512,slots=31,maxmem=16384M -object > memory-backend-ram,id=mem0,size=512M -device > pc-dimm,id=dimm0,node=0,memdev=mem0 > > I omitted disk backend in this example, but there is a chance that my > problem is not reproducible without some calls made explicitly by a > bootloader (not sure what to say for mid-runtime failures). It seems to survive OK: while true; do (sleep 1; echo -e '\001cc\n'; sleep 5; echo -e 'q\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -enable-kvm -name vmtest -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -device sga -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m 512,slots=31,maxmem=16384M -object memory-backend-ram,id=mem0,size=512M -device pc-dimm,id=dimm0,node=0,memdev=mem0 ~/pi.vfd 2>&1 | tee /tmp/qemu.op; grep "internal error" /tmp/qemu.op -q && break; done Dave > > > > >> Monday, so I`ll be able to test a) against 1.8.0-release, b) against > >> patched bios code, c) reproduce initial error on master/3.19 (may be > >> I`ll take them before weekend by going into this computer shop in > >> person). Until then, I have a very deep feeling that mine issue is not > >> there :) Also I became very curious on how a lack of IDT feature may > >> completely eliminate the issue appearance for me, the only possible > >> explanation is a clock-related race which is kinda stupid suggestion > >> and unlikely to exist in nature. > >> > >> Thanks again for everyone for throughout testing and ideas! > >> > >> > > >> >> -Kevin > >> >> > >> >> > >> >> --- a/src/romlayout.S > >> >> +++ b/src/romlayout.S > >> >> @@ -22,7 +22,8 @@ > >> >> // %edx = return location (in 32bit mode) > >> >> // Clobbers: ecx, flags, segment registers, cr0, idt/gdt > >> >> DECLFUNC transition32 > >> >> -transition32_for_smi: > >> >> +transition32_nmi_off: > >> >> + // transition32 when NMI and A20 are already initialized > >> >> movl %eax, %ecx > >> >> jmp 1f > >> >> transition32: > >> >> @@ -205,7 +206,7 @@ __farcall16: > >> >> entry_smi: > >> >> // Transition to 32bit mode. > >> >> movl $1f + BUILD_BIOS_ADDR, %edx > >> >> - jmp transition32_for_smi > >> >> + jmp transition32_nmi_off > >> >> .code32 > >> >> 1: movl $BUILD_SMM_ADDR + 0x8000, %esp > >> >> calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR > >> >> @@ -216,8 +217,10 @@ entry_smi: > >> >> DECLFUNC entry_smp > >> >> entry_smp: > >> >> // Transition to 32bit mode. > >> >> + cli > >> >> + cld > >> >> movl $2f + BUILD_BIOS_ADDR, %edx > >> >> - jmp transition32 > >> >> + jmp transition32_nmi_off > >> >> .code32 > >> >> // Acquire lock and take ownership of shared stack > >> >> 1: rep ; nop > >> > -- > >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 12, 2015 at 12:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > * Andrey Korolyov (andrey@xdel.ru) wrote: >> On Wed, Mar 11, 2015 at 10:59 PM, Dr. David Alan Gilbert >> <dgilbert@redhat.com> wrote: >> > * Andrey Korolyov (andrey@xdel.ru) wrote: >> >> On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert >> >> <dgilbert@redhat.com> wrote: >> >> > * Kevin O'Connor (kevin@koconnor.net) wrote: >> >> >> On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote: >> >> >> > On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote: >> >> >> > > For what it's worth, I can't seem to trigger the problem if I move the >> >> >> > > cmos read above the SIPI/LAPIC code (see patch below). >> >> >> > >> >> >> > Ugh! >> >> >> > >> >> >> > That's a seabios bug. Main processor modifies the rtc index >> >> >> > (rtc_read()) while APs try to clear the NMI bit by modifying the rtc >> >> >> > index (romlayout.S:transition32). >> >> >> > >> >> >> > I'll put together a fix. >> >> >> >> >> >> The seabios patch below resolves the issue for me. >> >> > >> >> > Thanks! Looks good here. >> >> > >> >> > Andrey, Paolo, Bandan: Does it fix it for you as well? >> >> > >> >> >> >> Thanks Kevin, Dave, >> >> >> >> I`m afraid that I`m hitting something different not only because >> >> different suberror code but also because of mine version of seabios - >> >> I am using 1.7.5 and corresponding code in the proposed patch looks >> >> different - there is no smp-related code patch is about of. Those >> >> mentioned devices went to production successfully and I`m afraid I >> >> cannot afford playing on them anymore, even if I re-trigger the issue >> >> with patched 1.8.1-rc, there is no way to switch to a different kernel >> >> and retest due to specific conditions of this production suite. I`ve >> >> ordered a pair of new shoes^W 2620v2-s which should arrive to me next >> > >> > Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case >> > was pretty simple. If you can suggest any flags I should add etc to the >> > test I'd be happy to give it a go. >> > >> > Dave >> >> Here is mine launch string: >> >> qemu-system-x86_64 -enable-kvm -name vmtest -S -machine >> pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512 >> -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa >> node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -nodefaults >> -device sga -rtc base=utc,driftfix=slew -global >> kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global >> PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on >> -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device >> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m >> 512,slots=31,maxmem=16384M -object >> memory-backend-ram,id=mem0,size=512M -device >> pc-dimm,id=dimm0,node=0,memdev=mem0 >> >> I omitted disk backend in this example, but there is a chance that my >> problem is not reproducible without some calls made explicitly by a >> bootloader (not sure what to say for mid-runtime failures). > > It seems to survive OK: Thanks David, I`ll go through test sequence and report. Unfortunately my orchestration does not have even a hundred millisecond precision for libvirt events, so I can`t tell if the immediate start-up failures happened before bootloader execution or during it, all I have for those is a less-than-two-second interval between actual pass of a launch command and paused state event. QEMU logging also does not give me timestamps for an emulation errors even with appropriate timestamp arg. > > while true; do (sleep 1; echo -e '\001cc\n'; sleep 5; echo -e 'q\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -enable-kvm -name vmtest -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -device sga -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m 512,slots=31,maxmem=16384M -object memory-backend-ram,id=mem0,size=512M -device pc-dimm,id=dimm0,node=0,memdev=mem0 ~/pi.vfd 2>&1 | tee /tmp/qemu.op; grep "internal error" /tmp/qemu.op -q && break; done > > Dave > >> >> > >> >> Monday, so I`ll be able to test a) against 1.8.0-release, b) against >> >> patched bios code, c) reproduce initial error on master/3.19 (may be >> >> I`ll take them before weekend by going into this computer shop in >> >> person). Until then, I have a very deep feeling that mine issue is not >> >> there :) Also I became very curious on how a lack of IDT feature may >> >> completely eliminate the issue appearance for me, the only possible >> >> explanation is a clock-related race which is kinda stupid suggestion >> >> and unlikely to exist in nature. >> >> >> >> Thanks again for everyone for throughout testing and ideas! >> >> >> >> > >> >> >> -Kevin >> >> >> >> >> >> >> >> >> --- a/src/romlayout.S >> >> >> +++ b/src/romlayout.S >> >> >> @@ -22,7 +22,8 @@ >> >> >> // %edx = return location (in 32bit mode) >> >> >> // Clobbers: ecx, flags, segment registers, cr0, idt/gdt >> >> >> DECLFUNC transition32 >> >> >> -transition32_for_smi: >> >> >> +transition32_nmi_off: >> >> >> + // transition32 when NMI and A20 are already initialized >> >> >> movl %eax, %ecx >> >> >> jmp 1f >> >> >> transition32: >> >> >> @@ -205,7 +206,7 @@ __farcall16: >> >> >> entry_smi: >> >> >> // Transition to 32bit mode. >> >> >> movl $1f + BUILD_BIOS_ADDR, %edx >> >> >> - jmp transition32_for_smi >> >> >> + jmp transition32_nmi_off >> >> >> .code32 >> >> >> 1: movl $BUILD_SMM_ADDR + 0x8000, %esp >> >> >> calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR >> >> >> @@ -216,8 +217,10 @@ entry_smi: >> >> >> DECLFUNC entry_smp >> >> >> entry_smp: >> >> >> // Transition to 32bit mode. >> >> >> + cli >> >> >> + cld >> >> >> movl $2f + BUILD_BIOS_ADDR, %edx >> >> >> - jmp transition32 >> >> >> + jmp transition32_nmi_off >> >> >> .code32 >> >> >> // Acquire lock and take ownership of shared stack >> >> >> 1: rep ; nop >> >> > -- >> >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >> > -- >> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
For now, it looks like bug have a mixed Murphy-Heisenberg nature, as it appearance is very rare (compared to the number of actual launches) and most probably bounded to the physical characteristics of my production nodes. As soon as I reach any reproducible path for a regular workstation environment, I`ll let everyone know. Also I am starting to think that issue can belong to the particular motherboard firmware revision, despite fact that the CPU microcode is the same everywhere. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Andrey Korolyov (andrey@xdel.ru) wrote: > For now, it looks like bug have a mixed Murphy-Heisenberg nature, as > it appearance is very rare (compared to the number of actual launches) > and most probably bounded to the physical characteristics of my > production nodes. As soon as I reach any reproducible path for a > regular workstation environment, I`ll let everyone know. Also I am > starting to think that issue can belong to the particular motherboard > firmware revision, despite fact that the CPU microcode is the same > everywhere. OK - so you're still seeing it with the new ROM that went in today? ( remotes/kraxel/tags/pull-seabios-1.8.1-20150316-1 ) and it doesn't trigger with my one line script? Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov <andrey@xdel.ru> wrote: > For now, it looks like bug have a mixed Murphy-Heisenberg nature, as > it appearance is very rare (compared to the number of actual launches) > and most probably bounded to the physical characteristics of my > production nodes. As soon as I reach any reproducible path for a > regular workstation environment, I`ll let everyone know. Also I am > starting to think that issue can belong to the particular motherboard > firmware revision, despite fact that the CPU microcode is the same > everywhere. Hello everyone, I`ve managed to reproduce this issue *deterministically* with latest seabios with smp fix and 3.18.3. The error occuring just *once* per vm until hypervisor reboots, at least in my setup, this is definitely crazy... - launch two VMs (Centos 7 in my case), - wait a little while they are booting, - attach serial console (I am using virsh list for this exact purpose), - issue acpi reboot or reset, does not matter, - VM always hangs at boot, most times with sgabios initialization string printed out [1], but sometimes it hangs a bit later [2], - no matter how many times I try to relaunch the QEMU afterwards, the issue does not appear on VM which experienced problem once; - trace and sample args can be seen in [3] and [4] respectively. 1) Google, Inc. Serial Graphics Adapter 06/11/14 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 Term: 211x62 4 0 2) Google, Inc. Serial Graphics Adapter 06/11/14 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 Term: 211x62 4 0 [...empty screen...] SeaBIOS (version 1.8.1-20150325_230423-testnode) Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1 iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10 3) KVM internal error. Suberror: 2 extra data[0]: 800000ef extra data[1]: 80000b0d EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006d2c EIP=0000d331 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 000f0000 0000ffff 00009b00 SS =0000 00000000 0000ffff 00009300 DS =0000 00000000 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 000f6cb0 00000037 IDT= 00000000 000003ff CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb <cd> 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f ba 2d d4 fe fb 3f 4) /usr/bin/qemu-system-x86_64 -name centos71 -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -uuid 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config -nodefaults -device sga -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1 -msg timestamp=on -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> - attach serial console (I am using virsh list for this exact purpose),
virsh console of course, sorry
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 25, 2015 at 11:43:31PM +0300, Andrey Korolyov wrote: > On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov <andrey@xdel.ru> wrote: > > For now, it looks like bug have a mixed Murphy-Heisenberg nature, as > > it appearance is very rare (compared to the number of actual launches) > > and most probably bounded to the physical characteristics of my > > production nodes. As soon as I reach any reproducible path for a > > regular workstation environment, I`ll let everyone know. Also I am > > starting to think that issue can belong to the particular motherboard > > firmware revision, despite fact that the CPU microcode is the same > > everywhere. > > > Hello everyone, I`ve managed to reproduce this issue > *deterministically* with latest seabios with smp fix and 3.18.3. The > error occuring just *once* per vm until hypervisor reboots, at least > in my setup, this is definitely crazy... > > - launch two VMs (Centos 7 in my case), > - wait a little while they are booting, > - attach serial console (I am using virsh list for this exact purpose), > - issue acpi reboot or reset, does not matter, > - VM always hangs at boot, most times with sgabios initialization > string printed out [1], but sometimes it hangs a bit later [2], > - no matter how many times I try to relaunch the QEMU afterwards, the > issue does not appear on VM which experienced problem once; > - trace and sample args can be seen in [3] and [4] respectively. Can you add something like: -chardev file,path=seabioslog.`date +%s`,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios to the qemu command line and forward the resulting log from both a succesful boot and a failed one? -Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 25, 2015 at 11:54 PM, Kevin O'Connor <kevin@koconnor.net> wrote: > On Wed, Mar 25, 2015 at 11:43:31PM +0300, Andrey Korolyov wrote: >> On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov <andrey@xdel.ru> wrote: >> > For now, it looks like bug have a mixed Murphy-Heisenberg nature, as >> > it appearance is very rare (compared to the number of actual launches) >> > and most probably bounded to the physical characteristics of my >> > production nodes. As soon as I reach any reproducible path for a >> > regular workstation environment, I`ll let everyone know. Also I am >> > starting to think that issue can belong to the particular motherboard >> > firmware revision, despite fact that the CPU microcode is the same >> > everywhere. >> >> >> Hello everyone, I`ve managed to reproduce this issue >> *deterministically* with latest seabios with smp fix and 3.18.3. The >> error occuring just *once* per vm until hypervisor reboots, at least >> in my setup, this is definitely crazy... >> >> - launch two VMs (Centos 7 in my case), >> - wait a little while they are booting, >> - attach serial console (I am using virsh list for this exact purpose), >> - issue acpi reboot or reset, does not matter, >> - VM always hangs at boot, most times with sgabios initialization >> string printed out [1], but sometimes it hangs a bit later [2], >> - no matter how many times I try to relaunch the QEMU afterwards, the >> issue does not appear on VM which experienced problem once; >> - trace and sample args can be seen in [3] and [4] respectively. > > Can you add something like: > > -chardev file,path=seabioslog.`date +%s`,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios > > to the qemu command line and forward the resulting log from both a > succesful boot and a failed one? > > -Kevin Of course, logs are attached.
Hi Andrey, Andrey Korolyov <andrey@xdel.ru> writes: > On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov <andrey@xdel.ru> wrote: >> For now, it looks like bug have a mixed Murphy-Heisenberg nature, as >> it appearance is very rare (compared to the number of actual launches) >> and most probably bounded to the physical characteristics of my >> production nodes. As soon as I reach any reproducible path for a >> regular workstation environment, I`ll let everyone know. Also I am >> starting to think that issue can belong to the particular motherboard >> firmware revision, despite fact that the CPU microcode is the same >> everywhere. I will take the risk and say this - "could it be a processor bug ?" :) > > Hello everyone, I`ve managed to reproduce this issue > *deterministically* with latest seabios with smp fix and 3.18.3. The > error occuring just *once* per vm until hypervisor reboots, at least > in my setup, this is definitely crazy... > > - launch two VMs (Centos 7 in my case), > - wait a little while they are booting, > - attach serial console (I am using virsh list for this exact purpose), > - issue acpi reboot or reset, does not matter, > - VM always hangs at boot, most times with sgabios initialization > string printed out [1], but sometimes it hangs a bit later [2], > - no matter how many times I try to relaunch the QEMU afterwards, the > issue does not appear on VM which experienced problem once; > - trace and sample args can be seen in [3] and [4] respectively. My system is a Dell R720 dual socket which has 2620v2s. I tried your setup but couldn't reproduce (my qemu cmdline isn't exactly the same as yours), although, if you could simplify your command line a bit, I can try again. Bandan > 1) > Google, Inc. > Serial Graphics Adapter 06/11/14 > SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ > (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 > Term: 211x62 > 4 0 > > 2) > Google, Inc. > Serial Graphics Adapter 06/11/14 > SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ > (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 > Term: 211x62 > 4 0 > [...empty screen...] > SeaBIOS (version 1.8.1-20150325_230423-testnode) > Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1 > > > iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10 > > 3) > > KVM internal error. Suberror: 2 > extra data[0]: 800000ef > extra data[1]: 80000b0d > EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000 > ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006d2c > EIP=0000d331 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0000 00000000 0000ffff 00009300 > CS =f000 000f0000 0000ffff 00009b00 > SS =0000 00000000 0000ffff 00009300 > DS =0000 00000000 0000ffff 00009300 > FS =0000 00000000 0000ffff 00009300 > GS =0000 00000000 0000ffff 00009300 > LDT=0000 00000000 0000ffff 00008200 > TR =0000 00000000 0000ffff 00008b00 > GDT= 000f6cb0 00000037 > IDT= 00000000 000003ff > CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 > DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000000 > Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb <cd> > 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f > ba 2d d4 fe fb 3f > > 4) > /usr/bin/qemu-system-x86_64 -name centos71 -S -machine > pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios > /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp > 12,sockets=1,cores=12,threads=12 -uuid > 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config > -nodefaults -device sga -chardev > socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait > -mon chardev=charmonitor,id=monitor,mode=control -rtc > base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard > -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global > PIIX4_PM.disable_s4=1 -boot strict=on -device > nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device > virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive > file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 > -chardev pty,id=charserial0 -device > isa-serial,chardev=charserial0,id=serial0 -chardev > socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait > -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1 > -msg timestamp=on -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 26, 2015 at 5:47 AM, Bandan Das <bsd@redhat.com> wrote: > Hi Andrey, > > Andrey Korolyov <andrey@xdel.ru> writes: > >> On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov <andrey@xdel.ru> wrote: >>> For now, it looks like bug have a mixed Murphy-Heisenberg nature, as >>> it appearance is very rare (compared to the number of actual launches) >>> and most probably bounded to the physical characteristics of my >>> production nodes. As soon as I reach any reproducible path for a >>> regular workstation environment, I`ll let everyone know. Also I am >>> starting to think that issue can belong to the particular motherboard >>> firmware revision, despite fact that the CPU microcode is the same >>> everywhere. > > I will take the risk and say this - "could it be a processor bug ?" :) > >> >> Hello everyone, I`ve managed to reproduce this issue >> *deterministically* with latest seabios with smp fix and 3.18.3. The >> error occuring just *once* per vm until hypervisor reboots, at least >> in my setup, this is definitely crazy... >> >> - launch two VMs (Centos 7 in my case), >> - wait a little while they are booting, >> - attach serial console (I am using virsh list for this exact purpose), >> - issue acpi reboot or reset, does not matter, >> - VM always hangs at boot, most times with sgabios initialization >> string printed out [1], but sometimes it hangs a bit later [2], >> - no matter how many times I try to relaunch the QEMU afterwards, the >> issue does not appear on VM which experienced problem once; >> - trace and sample args can be seen in [3] and [4] respectively. > > My system is a Dell R720 dual socket which has 2620v2s. I tried your > setup but couldn't reproduce (my qemu cmdline isn't exactly the same > as yours), although, if you could simplify your command line a bit, > I can try again. > > Bandan > >> 1) >> Google, Inc. >> Serial Graphics Adapter 06/11/14 >> SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ >> (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 >> Term: 211x62 >> 4 0 >> >> 2) >> Google, Inc. >> Serial Graphics Adapter 06/11/14 >> SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ >> (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 >> Term: 211x62 >> 4 0 >> [...empty screen...] >> SeaBIOS (version 1.8.1-20150325_230423-testnode) >> Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1 >> >> >> iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10 >> >> 3) >> >> KVM internal error. Suberror: 2 >> extra data[0]: 800000ef >> extra data[1]: 80000b0d >> EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000 >> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006d2c >> EIP=0000d331 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> ES =0000 00000000 0000ffff 00009300 >> CS =f000 000f0000 0000ffff 00009b00 >> SS =0000 00000000 0000ffff 00009300 >> DS =0000 00000000 0000ffff 00009300 >> FS =0000 00000000 0000ffff 00009300 >> GS =0000 00000000 0000ffff 00009300 >> LDT=0000 00000000 0000ffff 00008200 >> TR =0000 00000000 0000ffff 00008b00 >> GDT= 000f6cb0 00000037 >> IDT= 00000000 000003ff >> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 >> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >> DR3=0000000000000000 >> DR6=00000000ffff0ff0 DR7=0000000000000400 >> EFER=0000000000000000 >> Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb <cd> >> 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f >> ba 2d d4 fe fb 3f >> >> 4) >> /usr/bin/qemu-system-x86_64 -name centos71 -S -machine >> pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios >> /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp >> 12,sockets=1,cores=12,threads=12 -uuid >> 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config >> -nodefaults -device sga -chardev >> socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait >> -mon chardev=charmonitor,id=monitor,mode=control -rtc >> base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard >> -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global >> PIIX4_PM.disable_s4=1 -boot strict=on -device >> nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device >> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive >> file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native >> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >> -chardev pty,id=charserial0 -device >> isa-serial,chardev=charserial0,id=serial0 -chardev >> socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait >> -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1 >> -msg timestamp=on Hehe, 2.2 works just perfectly but 2.1 isn`t. I`ll bisect the issue in a next couple of days and post the right commit (but as can remember none of commits b/w 2.1 and 2.2 can fix simular issue by a purpose). I`ve attached a reference xml to simplify playing with libvirt if anyone willing to do so.
On Thu, Mar 26, 2015 at 12:18 PM, Andrey Korolyov <andrey@xdel.ru> wrote: > On Thu, Mar 26, 2015 at 5:47 AM, Bandan Das <bsd@redhat.com> wrote: >> Hi Andrey, >> >> Andrey Korolyov <andrey@xdel.ru> writes: >> >>> On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov <andrey@xdel.ru> wrote: >>>> For now, it looks like bug have a mixed Murphy-Heisenberg nature, as >>>> it appearance is very rare (compared to the number of actual launches) >>>> and most probably bounded to the physical characteristics of my >>>> production nodes. As soon as I reach any reproducible path for a >>>> regular workstation environment, I`ll let everyone know. Also I am >>>> starting to think that issue can belong to the particular motherboard >>>> firmware revision, despite fact that the CPU microcode is the same >>>> everywhere. >> >> I will take the risk and say this - "could it be a processor bug ?" :) >> >>> >>> Hello everyone, I`ve managed to reproduce this issue >>> *deterministically* with latest seabios with smp fix and 3.18.3. The >>> error occuring just *once* per vm until hypervisor reboots, at least >>> in my setup, this is definitely crazy... >>> >>> - launch two VMs (Centos 7 in my case), >>> - wait a little while they are booting, >>> - attach serial console (I am using virsh list for this exact purpose), >>> - issue acpi reboot or reset, does not matter, >>> - VM always hangs at boot, most times with sgabios initialization >>> string printed out [1], but sometimes it hangs a bit later [2], >>> - no matter how many times I try to relaunch the QEMU afterwards, the >>> issue does not appear on VM which experienced problem once; >>> - trace and sample args can be seen in [3] and [4] respectively. >> >> My system is a Dell R720 dual socket which has 2620v2s. I tried your >> setup but couldn't reproduce (my qemu cmdline isn't exactly the same >> as yours), although, if you could simplify your command line a bit, >> I can try again. >> >> Bandan >> >>> 1) >>> Google, Inc. >>> Serial Graphics Adapter 06/11/14 >>> SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ >>> (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 >>> Term: 211x62 >>> 4 0 >>> >>> 2) >>> Google, Inc. >>> Serial Graphics Adapter 06/11/14 >>> SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ >>> (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014 >>> Term: 211x62 >>> 4 0 >>> [...empty screen...] >>> SeaBIOS (version 1.8.1-20150325_230423-testnode) >>> Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1 >>> >>> >>> iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10 >>> >>> 3) >>> >>> KVM internal error. Suberror: 2 >>> extra data[0]: 800000ef >>> extra data[1]: 80000b0d >>> EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000 >>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006d2c >>> EIP=0000d331 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 >>> ES =0000 00000000 0000ffff 00009300 >>> CS =f000 000f0000 0000ffff 00009b00 >>> SS =0000 00000000 0000ffff 00009300 >>> DS =0000 00000000 0000ffff 00009300 >>> FS =0000 00000000 0000ffff 00009300 >>> GS =0000 00000000 0000ffff 00009300 >>> LDT=0000 00000000 0000ffff 00008200 >>> TR =0000 00000000 0000ffff 00008b00 >>> GDT= 000f6cb0 00000037 >>> IDT= 00000000 000003ff >>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 >>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >>> DR3=0000000000000000 >>> DR6=00000000ffff0ff0 DR7=0000000000000400 >>> EFER=0000000000000000 >>> Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb <cd> >>> 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f >>> ba 2d d4 fe fb 3f >>> >>> 4) >>> /usr/bin/qemu-system-x86_64 -name centos71 -S -machine >>> pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios >>> /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp >>> 12,sockets=1,cores=12,threads=12 -uuid >>> 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config >>> -nodefaults -device sga -chardev >>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait >>> -mon chardev=charmonitor,id=monitor,mode=control -rtc >>> base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard >>> -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global >>> PIIX4_PM.disable_s4=1 -boot strict=on -device >>> nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device >>> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive >>> file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XXXXXXXXXXXXXXXXXXXXXXXXXX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native >>> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >>> -chardev pty,id=charserial0 -device >>> isa-serial,chardev=charserial0,id=serial0 -chardev >>> socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait >>> -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1 >>> -msg timestamp=on > > Hehe, 2.2 works just perfectly but 2.1 isn`t. I`ll bisect the issue in > a next couple of days and post the right commit (but as can remember > none of commits b/w 2.1 and 2.2 can fix simular issue by a purpose). > I`ve attached a reference xml to simplify playing with libvirt if > anyone willing to do so. Sorry, 2.2 hangs as well but more rarely. Looks like it is important to conduct the test sequence on a freshly booted host, as issue tends to not reappear during the hypervisor boot cycle. Please let me know if host kernel config is needed, for example if nobody will be able to reproduce this. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/src/romlayout.S +++ b/src/romlayout.S @@ -22,7 +22,8 @@ // %edx = return location (in 32bit mode) // Clobbers: ecx, flags, segment registers, cr0, idt/gdt DECLFUNC transition32 -transition32_for_smi: +transition32_nmi_off: + // transition32 when NMI and A20 are already initialized movl %eax, %ecx jmp 1f transition32: @@ -205,7 +206,7 @@ __farcall16: entry_smi: // Transition to 32bit mode. movl $1f + BUILD_BIOS_ADDR, %edx - jmp transition32_for_smi + jmp transition32_nmi_off .code32 1: movl $BUILD_SMM_ADDR + 0x8000, %esp calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR @@ -216,8 +217,10 @@ entry_smi: DECLFUNC entry_smp entry_smp: // Transition to 32bit mode. + cli + cld movl $2f + BUILD_BIOS_ADDR, %edx - jmp transition32 + jmp transition32_nmi_off .code32 // Acquire lock and take ownership of shared stack 1: rep ; nop