Message ID | 87fuf0w6dp.fsf@abhimanyu.i-did-not-set--mail-host-address--so-tickle-me (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 16, 2017 at 04:23:38PM +0530, Nikunj A Dadhania wrote: > Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> writes: > > > Greg Kurz <groug@kaod.org> writes: > > > >> On Sun, 11 Jun 2017 17:38:42 +0800 > >> David Gibson <david@gibson.dropbear.id.au> wrote: > >> > >>> On Fri, Jun 09, 2017 at 05:09:13PM +0200, Greg Kurz wrote: > >>> > On Fri, 9 Jun 2017 20:28:32 +1000 > >>> > David Gibson <david@gibson.dropbear.id.au> wrote: > >>> > > >>> > > On Fri, Jun 09, 2017 at 11:36:31AM +0200, Greg Kurz wrote: > >>> > > > On Fri, 9 Jun 2017 12:28:13 +1000 > >>> > > > David Gibson <david@gibson.dropbear.id.au> wrote: > >>> > > > > >>> > 1) start guest > >>> > > >>> > qemu-system-ppc64 \ > >>> > -nodefaults -nographic -snapshot -no-shutdown -serial mon:stdio \ > >>> > -device virtio-net,netdev=netdev0,id=net0 \ > >>> > -netdev bridge,id=netdev0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \ > >>> > -device virtio-blk,drive=drive0,id=blk0 \ > >>> > -drive file=/home/greg/images/sle12-sp1-ppc64le.qcow2,id=drive0,if=none \ > >>> > -machine type=pseries,accel=tcg -cpu POWER8 > > > > Strangely, your command line does not have multiple threads. Need to see > > what is the side effect of enabling MTTCG by default here. > > > >>> > > >>> > 2) migrate > >>> > > >>> > 3) destination crashes (immediately or after very short delay) or > >>> > hangs > >>> > >>> Ok. I'll bisect it when I can, but you might well get to it first. > >>> > >>> > >> > >> Heh, maybe you didn't see in my mail but I did bisect: > >> > >> f0b0685d6694a28c66018f438e822596243b1250 is the first bad commit > >> commit f0b0685d6694a28c66018f438e822596243b1250 > >> Author: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> > >> Date: Thu Apr 27 10:48:23 2017 +0530 > >> > >> tcg: enable MTTCG by default for PPC64 on x86 > > > > Let me have a look at it. > > Interesting problem here, I see that when the migration is completed on > source and there is a crash on destination: > > [ 56.185314] Unable to handle kernel paging request for data at address 0x5deadbeef0000108 > [ 56.185401] Faulting instruction address: 0xc000000000277bc8 > > 0xc000000000277bb8 <+168>: ld r7,8(r4) > 0xc000000000277bbc <+172>: ld r6,0(r4) <======== > 0xc000000000277bc0 <+176>: ori r8,r8,56302 > 0xc000000000277bc4 <+180>: rldicr r8,r8,32,31 > 0xc000000000277bc8 <+184>: std r7,8(r6) > > r4 = 0xf0000000000107a0 > r6 = 0x5deadbeef0000100 > > Code at 0xc000000000277bbc <+172>, gave junk value in r6, that leads to > the guest crash. When I inspect the memory on source and destination in > qemu monitor, I get the following differences: > > diff -u s.txt d.txt > --- s.txt 2017-06-16 10:34:39.657221125 +0530 > +++ d.txt 2017-06-16 10:34:18.452238305 +0530 > @@ -8,8 +8,8 @@ > f000000000010760: 0x20de0b00 0x000000f0 0x60040100 0x000000f0 > f000000000010770: 0x00000000 0x00000000 0x0004036d 0x000000c0 > f000000000010780: 0x6c000100 0xf8ff3f00 0x7817f977 0x000000c0 > -f000000000010790: 0x15000000 0x00000000 0xffffffff 0x01000000 > -f0000000000107a0: 0x3090a96d 0x000000c0 0x3090a96d 0x000000c0 > +f000000000010790: 0x01000000 0x00000000 0xffffffff 0x01000000 > +f0000000000107a0: 0x000100f0 0xeedbea5d 0x000200f0 0xeedbea5d > f0000000000107b0: 0x00000000 0x00000000 0x00d0a96d 0x000000c0 > f0000000000107c0: 0x28000000 0xf8ff3f00 0x8852cc77 0x000000c0 > f0000000000107d0: 0x00000000 0x00000000 0xffffffff 0x01000000 > > Source had a valid address at 0xf0000000000107a0, while garbage on the > destination. > > Some observations: > > * Source updates the memory location (probably atomic_cmpxchg), but the > updated page didnt get transferred to the destination > > * Getting rid of atomic_cmpxchg tcg ops in ldarx/stdcx, makes migration > work fine. MTTCG running with 1 cpu. > > While I continue debugging, any hints would help. My first guess would be that some or all of the new TCG atomic primitives aren't updating the dirty page bitmap. My second guess would be a race between the atomic TCG ops and the migration / dirty map handling which means we can lost a memory update and not transfer it to the destination.
diff -u s.txt d.txt --- s.txt 2017-06-16 10:34:39.657221125 +0530 +++ d.txt 2017-06-16 10:34:18.452238305 +0530 @@ -8,8 +8,8 @@ f000000000010760: 0x20de0b00 0x000000f0 0x60040100 0x000000f0 f000000000010770: 0x00000000 0x00000000 0x0004036d 0x000000c0 f000000000010780: 0x6c000100 0xf8ff3f00 0x7817f977 0x000000c0 -f000000000010790: 0x15000000 0x00000000 0xffffffff 0x01000000 -f0000000000107a0: 0x3090a96d 0x000000c0 0x3090a96d 0x000000c0 +f000000000010790: 0x01000000 0x00000000 0xffffffff 0x01000000 +f0000000000107a0: 0x000100f0 0xeedbea5d 0x000200f0 0xeedbea5d f0000000000107b0: 0x00000000 0x00000000 0x00d0a96d 0x000000c0 f0000000000107c0: 0x28000000 0xf8ff3f00 0x8852cc77 0x000000c0 f0000000000107d0: 0x00000000 0x00000000 0xffffffff 0x01000000