Message ID | 20230324184129.3119575-1-nsg@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | s390x TCG migration failure | expand |
On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote: > Hi, > > We're seeing failures running s390x migration kvm-unit-tests tests with TCG. > Some initial findings: > What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception. > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. Hi Nina, could you please open a ticket in the QEMU bug tracker and add the "8.0" label there, so that it correctly shows up in the list of things that should be fixed before the 8.0 release? > Applying > > diff --git a/migration/ram.c b/migration/ram.c > index 56ff9cd29d..2dc546cf28 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size, > > uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE; > > - if (!migration_in_postcopy()) { > + if (!migration_in_postcopy() && remaining_size < max_size) { > qemu_mutex_lock_iothread(); > WITH_RCU_READ_LOCK_GUARD() { > migration_bitmap_sync_precopy(rs); > > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.) > I arrived at this by experimentation, I haven't looked into why this makes a difference. > > Any thoughts on the matter appreciated. Juan, could you comment on this, please? Thomas
On Tue, 2023-03-28 at 15:01 +0200, Thomas Huth wrote: > On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote: > > Hi, > > > > We're seeing failures running s390x migration kvm-unit-tests tests with TCG. > > Some initial findings: > > What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception. > > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. > > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. > > Hi Nina, > > could you please open a ticket in the QEMU bug tracker and add the "8.0" > label there, so that it correctly shows up in the list of things that should > be fixed before the 8.0 release? https://gitlab.com/qemu-project/qemu/-/issues/1565 Not sure if I cannot add a label or if I overlooked how to.
On 29/03/2023 00.21, Nina Schoetterl-Glausch wrote: > On Tue, 2023-03-28 at 15:01 +0200, Thomas Huth wrote: >> On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote: >>> Hi, >>> >>> We're seeing failures running s390x migration kvm-unit-tests tests with TCG. >>> Some initial findings: >>> What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception. >>> I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. >>> The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. >> >> Hi Nina, >> >> could you please open a ticket in the QEMU bug tracker and add the "8.0" >> label there, so that it correctly shows up in the list of things that should >> be fixed before the 8.0 release? > > https://gitlab.com/qemu-project/qemu/-/issues/1565 Thanks! > Not sure if I cannot add a label or if I overlooked how to. Ah, you might need to be at least a "Reviewer" in the qemu-project to be able to add labels and milestones. You can ask one of the owners or maintainers (see https://gitlab.com/qemu-project/qemu/-/project_members ) to add you as a reviewer. Anyway, I added the 8.0 milestone now to the ticket. Thomas
On 29/03/2023 08.36, Thomas Huth wrote: > On 29/03/2023 00.21, Nina Schoetterl-Glausch wrote: >> On Tue, 2023-03-28 at 15:01 +0200, Thomas Huth wrote: >>> On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote: >>>> Hi, >>>> >>>> We're seeing failures running s390x migration kvm-unit-tests tests with >>>> TCG. >>>> Some initial findings: >>>> What seems to be happening is that after migration a control block >>>> header accessed by the test code is all zeros which causes an unexpected >>>> exception. >>>> I did a bisection which points to c8df4a7aef ("migration: Split >>>> save_live_pending() into state_pending_*") as the culprit. >>>> The migration issue persists after applying the fix e264705012 >>>> ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. >>> >>> Hi Nina, >>> >>> could you please open a ticket in the QEMU bug tracker and add the "8.0" >>> label there, so that it correctly shows up in the list of things that should >>> be fixed before the 8.0 release? >> >> https://gitlab.com/qemu-project/qemu/-/issues/1565 > > Thanks! Ping! Juan, did you have a chance to look at this issue yet? ... we're getting quite close to the 8.0 release, and IMHO this sounds like a critical bug? Thomas
Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote: > Hi, > > We're seeing failures running s390x migration kvm-unit-tests tests with TCG. > Some initial findings: > What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception. > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. > > Applying > > diff --git a/migration/ram.c b/migration/ram.c > index 56ff9cd29d..2dc546cf28 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size, > > uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE; > > - if (!migration_in_postcopy()) { > + if (!migration_in_postcopy() && remaining_size < max_size) { > qemu_mutex_lock_iothread(); > WITH_RCU_READ_LOCK_GUARD() { > migration_bitmap_sync_precopy(rs); > > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.) > I arrived at this by experimentation, I haven't looked into why this makes a difference. > > Any thoughts on the matter appreciated. Ouch, you are right. Good catch. Queued the fix. Later, Juan.
Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote: > Hi, > > We're seeing failures running s390x migration kvm-unit-tests tests with TCG. > Some initial findings: > What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception. > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. > > Applying > > diff --git a/migration/ram.c b/migration/ram.c > index 56ff9cd29d..2dc546cf28 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size, > > uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE; > > - if (!migration_in_postcopy()) { > + if (!migration_in_postcopy() && remaining_size < max_size) { > qemu_mutex_lock_iothread(); > WITH_RCU_READ_LOCK_GUARD() { > migration_bitmap_sync_precopy(rs); > > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.) > I arrived at this by experimentation, I haven't looked into why this makes a difference. > Any thoughts on the matter appreciated. This shouldn't be happening. Famous last words. I am still applying the patch, to get back to old behaviour, but we shouldn't be needing this. Basically when we call ram_state_pending_exact() we know that we want to sync the bitmap. But I guess that dirty block bitmap can be "interesting" to say the less. Later, Juan.
Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote: > Hi, > > We're seeing failures running s390x migration kvm-unit-tests tests with TCG. As this is tcg, could you tell the exact command that you are running? Does it needs to be in s390x host, rigth? $ time ./tests/qtest/migration-test # random seed: R02S940c4f22abc48b14868566639d3d6c77 # Skipping test: s390x host with KVM is required 1..0 real 0m0.003s user 0m0.002s sys 0m0.001s > Some initial findings: > What seems to be happening is that after migration a control block > header accessed by the test code is all zeros which causes an > unexpected exception. What exception? What do you mean here by control block header? > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. > > Applying > > diff --git a/migration/ram.c b/migration/ram.c > index 56ff9cd29d..2dc546cf28 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size, > > uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE; > > - if (!migration_in_postcopy()) { > + if (!migration_in_postcopy() && remaining_size < max_size) { If block is all zeros, then remaining_size should be zero, so always smaller than max_size. I don't really fully understand what is going here. > qemu_mutex_lock_iothread(); > WITH_RCU_READ_LOCK_GUARD() { > migration_bitmap_sync_precopy(rs); > > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.) > I arrived at this by experimentation, I haven't looked into why this makes a difference. > > Any thoughts on the matter appreciated. Later, Juan.
On Wed, 2023-04-12 at 23:01 +0200, Juan Quintela wrote: > Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote: > > Hi, > > > > We're seeing failures running s390x migration kvm-unit-tests tests with TCG. > > As this is tcg, could you tell the exact command that you are running? > Does it needs to be in s390x host, rigth? I've just tried with a cross compile of kvm-unit-tests and that fails, too. git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git cd kvm-unit-tests/ ./configure --cross-prefix=s390x-linux-gnu- --arch=s390x make for i in {0..30}; do echo $i; QEMU=../qemu/build/qemu-system-s390x ACCEL=tcg ./run_tests.sh migration-skey-sequential | grep FAIL && break; done > > $ time ./tests/qtest/migration-test I haven't looked if that test fails at all, we just noticed it with the kvm-unit-tests. > # random seed: R02S940c4f22abc48b14868566639d3d6c77 > # Skipping test: s390x host with KVM is required > 1..0 > > real 0m0.003s > user 0m0.002s > sys 0m0.001s > > > > Some initial findings: > > What seems to be happening is that after migration a control block > > header accessed by the test code is all zeros which causes an > > unexpected exception. > > What exception? > > What do you mean here by control block header? It's all s390x test guest specific stuff, I don't expect it to be too helpful. The guest gets a specification exception program interrupt while executing a SERVC because the SCCB control block is invalid. See https://gitlab.com/qemu-project/qemu/-/issues/1565 for a code snippet. The guest sets a bunch of fields in the SCCB header, but when TCG emulates the SERVC, they are zero which doesn't make sense. > > > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit. > > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef. > > > > Applying > > > > diff --git a/migration/ram.c b/migration/ram.c > > index 56ff9cd29d..2dc546cf28 100644 > > --- a/migration/ram.c > > +++ b/migration/ram.c > > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size, > > > > uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE; > > > > - if (!migration_in_postcopy()) { > > + if (!migration_in_postcopy() && remaining_size < max_size) { > > If block is all zeros, then remaining_size should be zero, so always > smaller than max_size. > > I don't really fully understand what is going here. > > > qemu_mutex_lock_iothread(); > > WITH_RCU_READ_LOCK_GUARD() { > > migration_bitmap_sync_precopy(rs); > > > > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.) > > I arrived at this by experimentation, I haven't looked into why this makes a difference. > > > > Any thoughts on the matter appreciated. > > Later, Juan. >
diff --git a/migration/ram.c b/migration/ram.c index 56ff9cd29d..2dc546cf28 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size, uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE; - if (!migration_in_postcopy()) { + if (!migration_in_postcopy() && remaining_size < max_size) { qemu_mutex_lock_iothread(); WITH_RCU_READ_LOCK_GUARD() { migration_bitmap_sync_precopy(rs);