Message ID | 20240523201922.28007-5-farosas@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | migration-test: Device migration smoke tests | expand |
On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote: > We have two new migration tests that check cross version > compatibility. One uses the vmstate-static-checker.py script to > compare the vmstate structures from two different QEMU versions. The > other runs a simple migration with a few devices present in the VM, to > catch obvious breakages. > > Add both tests to the migration-compat-common job. > > Signed-off-by: Fabiano Rosas <farosas@suse.de> > --- > .gitlab-ci.d/buildtest.yml | 43 +++++++++++++++++++++++++++++++------- > 1 file changed, 36 insertions(+), 7 deletions(-) > > diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml > index 91c57efded..bc7ac35983 100644 > --- a/.gitlab-ci.d/buildtest.yml > +++ b/.gitlab-ci.d/buildtest.yml > @@ -202,18 +202,47 @@ build-previous-qemu: > needs: > - job: build-previous-qemu > - job: build-system-opensuse > - # The old QEMU could have bugs unrelated to migration that are > - # already fixed in the current development branch, so this test > - # might fail. > + # This test is allowed to fail because: > + # > + # - The old QEMU could have bugs unrelated to migration that are > + # already fixed in the current development branch. Did you ever hit a real failure with this? I'm wondering whether we can remove this allow_failure thing. > + # > + # - The vmstate-static-checker script trips on renames and other > + # backward-compatible changes to the vmstate structs. I think I keep my preference per last time we talked on this. :) I still think it's too early to involve a test that can report false negative. I'd still keep running this before soft-freeze like I used to do, throw issues to others and urge them to fix before release. Per my previous experience that doesn't consume me a lot of time, and it's not common to see issues either. So I want people to really pay attention when someone sees a migration CI test failed, rather than we help people form the habit in "oh migration CI failed again? I think that's fine, it allows failing anyway". So far I still don't see as much benefit to adding this if we need to pay for the other false negative issue. I'll fully support it if e.g. we can fix the tool to avoid reporting false negatives, but that may take effort that I didn't check. > allow_failure: true > variables: > IMAGE: opensuse-leap > MAKE_CHECK_ARGS: check-build > script: > - # Use the migration-tests from the older QEMU tree. This avoids > - # testing an old QEMU against new features/tests that it is not > - # compatible with. > - - cd build-previous > + - cd build > + # device state static test: Tests the vmstate structures for > + # compatibility across QEMU versions. Uses the latest version of > + # the tests. > + # old to new > + - PYTHON=pyvenv/bin/python3 > + QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-${TARGET} > + QTEST_QEMU_BINARY=./qemu-system-${TARGET} > + ./tests/qtest/migration-test -p /${TARGET}/migration/vmstate-checker-script > + # new to old skipped because vmstate version bumps are always > + # backward incompatible. > + > + # device state runtime test: Performs a cross-version migration > + # with a select list of devices (see DEFAULT_DEVICES in > + # migration-test.c). Using the multifd tcp test here, but any will > + # do. > + # old to new > + - QTEST_DEVICE_OPTS=all QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-${TARGET} > + QTEST_QEMU_BINARY=./qemu-system-${TARGET} ./tests/qtest/migration-test > + -p /${TARGET}/migration/multifd/tcp/channels/plain/none > + # new to old > + - QTEST_DEVICE_OPTS=all QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-${TARGET} > + QTEST_QEMU_BINARY=./qemu-system-${TARGET} ./tests/qtest/migration-test > + -p /${TARGET}/migration/multifd/tcp/channels/plain/none > + > + # migration core tests: Use the migration-tests from the older > + # QEMU tree. This avoids testing an old QEMU against new > + # features/tests that it is not compatible with. > + - cd ../build-previous > # old to new > - QTEST_QEMU_BINARY_SRC=./qemu-system-${TARGET} > QTEST_QEMU_BINARY=../build/qemu-system-${TARGET} ./tests/qtest/migration-test > -- > 2.35.3 >
Peter Xu <peterx@redhat.com> writes: > On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote: >> We have two new migration tests that check cross version >> compatibility. One uses the vmstate-static-checker.py script to >> compare the vmstate structures from two different QEMU versions. The >> other runs a simple migration with a few devices present in the VM, to >> catch obvious breakages. >> >> Add both tests to the migration-compat-common job. >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de> >> --- >> .gitlab-ci.d/buildtest.yml | 43 +++++++++++++++++++++++++++++++------- >> 1 file changed, 36 insertions(+), 7 deletions(-) >> >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml >> index 91c57efded..bc7ac35983 100644 >> --- a/.gitlab-ci.d/buildtest.yml >> +++ b/.gitlab-ci.d/buildtest.yml >> @@ -202,18 +202,47 @@ build-previous-qemu: >> needs: >> - job: build-previous-qemu >> - job: build-system-opensuse >> - # The old QEMU could have bugs unrelated to migration that are >> - # already fixed in the current development branch, so this test >> - # might fail. >> + # This test is allowed to fail because: >> + # >> + # - The old QEMU could have bugs unrelated to migration that are >> + # already fixed in the current development branch. > > Did you ever hit a real failure with this? I'm wondering whether we can > remove this allow_failure thing. > I haven't. But when it fails we'll go through an entire release cycle with this thing showing red for every person that runs the CI. Remember, this is a CI failure to which there's no fix aside from waiting for the release to happen. Even if we're quick to react and disable the job, I feel it might create some confusion already. >> + # >> + # - The vmstate-static-checker script trips on renames and other >> + # backward-compatible changes to the vmstate structs. > > I think I keep my preference per last time we talked on this. :) Sorry, I'm not trying to force this in any way, I just wrote these to use in the pull-request and thought I'd put it out there. At the very least we can have your concerns documented. =) > I still think it's too early to involve a test that can report false > negative. (1) Well, we haven't seen any false negatives, we've seen fields being renamed. If that happens, then we'll ask the person to update the script. Is that not acceptable to you? Or are you thinking about other sorts of issues? > I'd still keep running this before soft-freeze like I used to > do, throw issues to others and urge them to fix before release. Having hidden procedures that maintainers run before a release is bad IMHO, it just delays the catching of bugs and frustrates contributors. Imagine working on a series, everything goes well with reviews, CI passes, patch gets queued and merged and a month later you get a ping about something you should have done to avoid breaking migration. Right during freeze. > Per my > previous experience that doesn't consume me a lot of time, and it's not > common to see issues either. > > So I want people to really pay attention when someone sees a migration CI > test failed, rather than we help people form the habit in "oh migration CI > failed again? I think that's fine, it allows failing anyway". That's a good point. I don't think it applies here though. See my point in (1). > So far I still don't see as much benefit to adding this if we need to pay > for the other false negative issue. I'll fully support it if e.g. we can > fix the tool to avoid reporting false negatives, but that may take effort > that I didn't check. >
On Mon, May 27, 2024 at 08:59:00PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote: > >> We have two new migration tests that check cross version > >> compatibility. One uses the vmstate-static-checker.py script to > >> compare the vmstate structures from two different QEMU versions. The > >> other runs a simple migration with a few devices present in the VM, to > >> catch obvious breakages. > >> > >> Add both tests to the migration-compat-common job. > >> > >> Signed-off-by: Fabiano Rosas <farosas@suse.de> > >> --- > >> .gitlab-ci.d/buildtest.yml | 43 +++++++++++++++++++++++++++++++------- > >> 1 file changed, 36 insertions(+), 7 deletions(-) > >> > >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml > >> index 91c57efded..bc7ac35983 100644 > >> --- a/.gitlab-ci.d/buildtest.yml > >> +++ b/.gitlab-ci.d/buildtest.yml > >> @@ -202,18 +202,47 @@ build-previous-qemu: > >> needs: > >> - job: build-previous-qemu > >> - job: build-system-opensuse > >> - # The old QEMU could have bugs unrelated to migration that are > >> - # already fixed in the current development branch, so this test > >> - # might fail. > >> + # This test is allowed to fail because: > >> + # > >> + # - The old QEMU could have bugs unrelated to migration that are > >> + # already fixed in the current development branch. > > > > Did you ever hit a real failure with this? I'm wondering whether we can > > remove this allow_failure thing. > > > > I haven't. But when it fails we'll go through an entire release cycle > with this thing showing red for every person that runs the CI. Remember, > this is a CI failure to which there's no fix aside from waiting for the > release to happen. Even if we're quick to react and disable the job, I > feel it might create some confusion already. My imagination was if needed we'll get complains and we add that until then for that broken release only, and remove in the next release again. > > >> + # > >> + # - The vmstate-static-checker script trips on renames and other > >> + # backward-compatible changes to the vmstate structs. > > > > I think I keep my preference per last time we talked on this. :) > > Sorry, I'm not trying to force this in any way, I just wrote these to > use in the pull-request and thought I'd put it out there. At the very > least we can have your concerns documented. =) Yep that's fine. I think we should keep such discussion on the list, especially we have different opinions, while none of us got convinced yet so far. :) > > > I still think it's too early to involve a test that can report false > > negative. > > (1) > Well, we haven't seen any false negatives, we've seen fields being > renamed. If that happens, then we'll ask the person to update the > script. Is that not acceptable to you? Or are you thinking about other > sorts of issues? Then question is how to update the script. So far it's treated as failure on rename, even if it's benign. Right now we have this: print("Section \"" + sec + "\",", end=' ') print("Description \"" + desc + "\":", end=' ') print("expected field \"" + s_item["field"] + "\",", end=' ') print("got \"" + d_item["field"] + "\"; skipping rest") bump_taint() break Do you want to introduce a list of renamed vmsd fields in this script and maintain that? IMHO it's an overkill and unnecessary burden to other developers. > > > I'd still keep running this before soft-freeze like I used to > > do, throw issues to others and urge them to fix before release. > > Having hidden procedures that maintainers run before a release is bad > IMHO, it just delays the catching of bugs and frustrates > contributors. Imagine working on a series, everything goes well with > reviews, CI passes, patch gets queued and merged and a month later you > get a ping about something you should have done to avoid breaking > migration. Right during freeze. I understand your point, however I don't yet see a way CI could cover everything. CI won't cover performance test, and I still ran multifd tests at least smoke it too before soft-freeze. If there's something done wrong, we notify the sooner the better. Now it looks to me the best trade-off is like that - we notify at soft-freeze once per release considering that's pretty rare too, e.g. 9.0 has no such outlier. Again I'm happy if we have a solution to not report false negatives; that's the only concern I have. > > > Per my > > previous experience that doesn't consume me a lot of time, and it's not > > common to see issues either. > > > > So I want people to really pay attention when someone sees a migration CI > > test failed, rather than we help people form the habit in "oh migration CI > > failed again? I think that's fine, it allows failing anyway". > > That's a good point. I don't think it applies here though. See my point > in (1). Yes, if you have an answer to (1) I'm ok too. Maybe I misunderstood your plan there. Thanks, > > > So far I still don't see as much benefit to adding this if we need to pay > > for the other false negative issue. I'll fully support it if e.g. we can > > fix the tool to avoid reporting false negatives, but that may take effort > > that I didn't check. > > >
Peter Xu <peterx@redhat.com> writes: > On Mon, May 27, 2024 at 08:59:00PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote: >> >> We have two new migration tests that check cross version >> >> compatibility. One uses the vmstate-static-checker.py script to >> >> compare the vmstate structures from two different QEMU versions. The >> >> other runs a simple migration with a few devices present in the VM, to >> >> catch obvious breakages. >> >> >> >> Add both tests to the migration-compat-common job. >> >> >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de> >> >> --- >> >> .gitlab-ci.d/buildtest.yml | 43 +++++++++++++++++++++++++++++++------- >> >> 1 file changed, 36 insertions(+), 7 deletions(-) >> >> >> >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml >> >> index 91c57efded..bc7ac35983 100644 >> >> --- a/.gitlab-ci.d/buildtest.yml >> >> +++ b/.gitlab-ci.d/buildtest.yml >> >> @@ -202,18 +202,47 @@ build-previous-qemu: >> >> needs: >> >> - job: build-previous-qemu >> >> - job: build-system-opensuse >> >> - # The old QEMU could have bugs unrelated to migration that are >> >> - # already fixed in the current development branch, so this test >> >> - # might fail. >> >> + # This test is allowed to fail because: >> >> + # >> >> + # - The old QEMU could have bugs unrelated to migration that are >> >> + # already fixed in the current development branch. >> > >> > Did you ever hit a real failure with this? I'm wondering whether we can >> > remove this allow_failure thing. >> > >> >> I haven't. But when it fails we'll go through an entire release cycle >> with this thing showing red for every person that runs the CI. Remember, >> this is a CI failure to which there's no fix aside from waiting for the >> release to happen. Even if we're quick to react and disable the job, I >> feel it might create some confusion already. > > My imagination was if needed we'll get complains and we add that until > then for that broken release only, and remove in the next release again. > >> >> >> + # >> >> + # - The vmstate-static-checker script trips on renames and other >> >> + # backward-compatible changes to the vmstate structs. >> > >> > I think I keep my preference per last time we talked on this. :) >> >> Sorry, I'm not trying to force this in any way, I just wrote these to >> use in the pull-request and thought I'd put it out there. At the very >> least we can have your concerns documented. =) > > Yep that's fine. I think we should keep such discussion on the list, > especially we have different opinions, while none of us got convinced yet > so far. :) > >> >> > I still think it's too early to involve a test that can report false >> > negative. >> >> (1) >> Well, we haven't seen any false negatives, we've seen fields being >> renamed. If that happens, then we'll ask the person to update the >> script. Is that not acceptable to you? Or are you thinking about other >> sorts of issues? > > Then question is how to update the script. So far it's treated as failure > on rename, even if it's benign. Right now we have this: > > print("Section \"" + sec + "\",", end=' ') > print("Description \"" + desc + "\":", end=' ') > print("expected field \"" + s_item["field"] + "\",", end=' ') > print("got \"" + d_item["field"] + "\"; skipping rest") > bump_taint() > break > > Do you want to introduce a list of renamed vmsd fields in this script and > maintain that? IMHO it's an overkill and unnecessary burden to other > developers. > That's not _my_ idea, we already have that (see below). There's not much reason to rename fields like that, the vmstate is obviously something that should be kept stable, so having to do a rename in a script is way better than having to figure out the fix for the compatibility break. def check_fields_match(name, s_field, d_field): if s_field == d_field: return True # Some fields changed names between qemu versions. This list # is used to allow such changes in each section / description. changed_names = { 'apic': ['timer', 'timer_expiry'], 'e1000': ['dev', 'parent_obj'], 'ehci': ['dev', 'pcidev'], 'I440FX': ['dev', 'parent_obj'], 'ich9_ahci': ['card', 'parent_obj'], 'ich9-ahci': ['ahci', 'ich9_ahci'], 'ioh3420': ['PCIDevice', 'PCIEDevice'], 'ioh-3240-express-root-port': ['port.br.dev', 'parent_obj.parent_obj.parent_obj', 'port.br.dev.exp.aer_log', 'parent_obj.parent_obj.parent_obj.exp.aer_log'], 'cirrus_vga': ['hw_cursor_x', 'vga.hw_cursor_x', 'hw_cursor_y', 'vga.hw_cursor_y'], 'lsiscsi': ['dev', 'parent_obj'], 'mch': ['d', 'parent_obj'], 'pci_bridge': ['bridge.dev', 'parent_obj', 'bridge.dev.shpc', 'shpc'], 'pcnet': ['pci_dev', 'parent_obj'], 'PIIX3': ['pci_irq_levels', 'pci_irq_levels_vmstate'], 'piix4_pm': ['dev', 'parent_obj', 'pci0_status', 'acpi_pci_hotplug.acpi_pcihp_pci_status[0x0]', 'pm1a.sts', 'ar.pm1.evt.sts', 'pm1a.en', 'ar.pm1.evt.en', 'pm1_cnt.cnt', 'ar.pm1.cnt.cnt', 'tmr.timer', 'ar.tmr.timer', 'tmr.overflow_time', 'ar.tmr.overflow_time', 'gpe', 'ar.gpe'], 'rtl8139': ['dev', 'parent_obj'], 'qxl': ['num_surfaces', 'ssd.num_surfaces'], 'usb-ccid': ['abProtocolDataStructure', 'abProtocolDataStructure.data'], 'usb-host': ['dev', 'parent_obj'], 'usb-mouse': ['usb-ptr-queue', 'HIDPointerEventQueue'], 'usb-tablet': ['usb-ptr-queue', 'HIDPointerEventQueue'], 'vmware_vga': ['card', 'parent_obj'], 'vmware_vga_internal': ['depth', 'new_depth'], 'xhci': ['pci_dev', 'parent_obj'], 'x3130-upstream': ['PCIDevice', 'PCIEDevice'], 'xio3130-express-downstream-port': ['port.br.dev', 'parent_obj.parent_obj.parent_obj', 'port.br.dev.exp.aer_log', 'parent_obj.parent_obj.parent_obj.exp.aer_log'], 'xio3130-downstream': ['PCIDevice', 'PCIEDevice'], 'xio3130-express-upstream-port': ['br.dev', 'parent_obj.parent_obj', 'br.dev.exp.aer_log', 'parent_obj.parent_obj.exp.aer_log'], 'spapr_pci': ['dma_liobn[0]', 'mig_liobn', 'mem_win_addr', 'mig_mem_win_addr', 'mem_win_size', 'mig_mem_win_size', 'io_win_addr', 'mig_io_win_addr', 'io_win_size', 'mig_io_win_size'], } if not name in changed_names: return False if s_field in changed_names[name] and d_field in changed_names[name]: return True return False >> >> > I'd still keep running this before soft-freeze like I used to >> > do, throw issues to others and urge them to fix before release. >> >> Having hidden procedures that maintainers run before a release is bad >> IMHO, it just delays the catching of bugs and frustrates >> contributors. Imagine working on a series, everything goes well with >> reviews, CI passes, patch gets queued and merged and a month later you >> get a ping about something you should have done to avoid breaking >> migration. Right during freeze. > > I understand your point, however I don't yet see a way CI could cover > everything. CI won't cover performance test, and I still ran multifd tests > at least smoke it too before soft-freeze. > > If there's something done wrong, we notify the sooner the better. Now it > looks to me the best trade-off is like that - we notify at soft-freeze once > per release considering that's pretty rare too, e.g. 9.0 has no such outlier. We should notify before merging any code. If there is a tool that can catch the bug, there's no point in wasting everyone's time carrying that patch forward. That's the purpose of a CI setup. These tests are very light in terms of resources. Now, you're making the point that these issues are rare, which they may very well be, I haven't ran this long enough to know. I also don't know how to measure that, I believe our CI already has a large amount of tests that never catch anything, so I'm not sure how much we should take it in consideration when adding tests to the CI. About the possibly false results, we'll have to hold merging this for a few releases and gather more data, see how much of a bother it is to keep updating these fields. I expect to have more bugs caught by the script than any renames happening. > > Again I'm happy if we have a solution to not report false negatives; that's > the only concern I have. >> >> > Per my >> > previous experience that doesn't consume me a lot of time, and it's not >> > common to see issues either. >> > >> > So I want people to really pay attention when someone sees a migration CI >> > test failed, rather than we help people form the habit in "oh migration CI >> > failed again? I think that's fine, it allows failing anyway". >> >> That's a good point. I don't think it applies here though. See my point >> in (1). > > Yes, if you have an answer to (1) I'm ok too. Maybe I misunderstood your > plan there. > > Thanks, > >> >> > So far I still don't see as much benefit to adding this if we need to pay >> > for the other false negative issue. I'll fully support it if e.g. we can >> > fix the tool to avoid reporting false negatives, but that may take effort >> > that I didn't check. >> > >>
On Tue, May 28, 2024 at 03:10:48PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Mon, May 27, 2024 at 08:59:00PM -0300, Fabiano Rosas wrote: > >> Peter Xu <peterx@redhat.com> writes: > >> > >> > On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote: > >> >> We have two new migration tests that check cross version > >> >> compatibility. One uses the vmstate-static-checker.py script to > >> >> compare the vmstate structures from two different QEMU versions. The > >> >> other runs a simple migration with a few devices present in the VM, to > >> >> catch obvious breakages. > >> >> > >> >> Add both tests to the migration-compat-common job. > >> >> > >> >> Signed-off-by: Fabiano Rosas <farosas@suse.de> > >> >> --- > >> >> .gitlab-ci.d/buildtest.yml | 43 +++++++++++++++++++++++++++++++------- > >> >> 1 file changed, 36 insertions(+), 7 deletions(-) > >> >> > >> >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml > >> >> index 91c57efded..bc7ac35983 100644 > >> >> --- a/.gitlab-ci.d/buildtest.yml > >> >> +++ b/.gitlab-ci.d/buildtest.yml > >> >> @@ -202,18 +202,47 @@ build-previous-qemu: > >> >> needs: > >> >> - job: build-previous-qemu > >> >> - job: build-system-opensuse > >> >> - # The old QEMU could have bugs unrelated to migration that are > >> >> - # already fixed in the current development branch, so this test > >> >> - # might fail. > >> >> + # This test is allowed to fail because: > >> >> + # > >> >> + # - The old QEMU could have bugs unrelated to migration that are > >> >> + # already fixed in the current development branch. > >> > > >> > Did you ever hit a real failure with this? I'm wondering whether we can > >> > remove this allow_failure thing. > >> > > >> > >> I haven't. But when it fails we'll go through an entire release cycle > >> with this thing showing red for every person that runs the CI. Remember, > >> this is a CI failure to which there's no fix aside from waiting for the > >> release to happen. Even if we're quick to react and disable the job, I > >> feel it might create some confusion already. > > > > My imagination was if needed we'll get complains and we add that until > > then for that broken release only, and remove in the next release again. > > > >> > >> >> + # > >> >> + # - The vmstate-static-checker script trips on renames and other > >> >> + # backward-compatible changes to the vmstate structs. > >> > > >> > I think I keep my preference per last time we talked on this. :) > >> > >> Sorry, I'm not trying to force this in any way, I just wrote these to > >> use in the pull-request and thought I'd put it out there. At the very > >> least we can have your concerns documented. =) > > > > Yep that's fine. I think we should keep such discussion on the list, > > especially we have different opinions, while none of us got convinced yet > > so far. :) > > > >> > >> > I still think it's too early to involve a test that can report false > >> > negative. > >> > >> (1) > >> Well, we haven't seen any false negatives, we've seen fields being > >> renamed. If that happens, then we'll ask the person to update the > >> script. Is that not acceptable to you? Or are you thinking about other > >> sorts of issues? > > > > Then question is how to update the script. So far it's treated as failure > > on rename, even if it's benign. Right now we have this: > > > > print("Section \"" + sec + "\",", end=' ') > > print("Description \"" + desc + "\":", end=' ') > > print("expected field \"" + s_item["field"] + "\",", end=' ') > > print("got \"" + d_item["field"] + "\"; skipping rest") > > bump_taint() > > break > > > > Do you want to introduce a list of renamed vmsd fields in this script and > > maintain that? IMHO it's an overkill and unnecessary burden to other > > developers. > > > > That's not _my_ idea, we already have that (see below). There's not much > reason to rename fields like that, the vmstate is obviously something > that should be kept stable, so having to do a rename in a script is way > better than having to figure out the fix for the compatibility break. > > def check_fields_match(name, s_field, d_field): > if s_field == d_field: > return True > > # Some fields changed names between qemu versions. This list > # is used to allow such changes in each section / description. > changed_names = { > 'apic': ['timer', 'timer_expiry'], > 'e1000': ['dev', 'parent_obj'], > 'ehci': ['dev', 'pcidev'], > 'I440FX': ['dev', 'parent_obj'], > 'ich9_ahci': ['card', 'parent_obj'], > 'ich9-ahci': ['ahci', 'ich9_ahci'], > 'ioh3420': ['PCIDevice', 'PCIEDevice'], > 'ioh-3240-express-root-port': ['port.br.dev', > 'parent_obj.parent_obj.parent_obj', > 'port.br.dev.exp.aer_log', > 'parent_obj.parent_obj.parent_obj.exp.aer_log'], > 'cirrus_vga': ['hw_cursor_x', 'vga.hw_cursor_x', > 'hw_cursor_y', 'vga.hw_cursor_y'], > 'lsiscsi': ['dev', 'parent_obj'], > 'mch': ['d', 'parent_obj'], > 'pci_bridge': ['bridge.dev', 'parent_obj', 'bridge.dev.shpc', 'shpc'], > 'pcnet': ['pci_dev', 'parent_obj'], > 'PIIX3': ['pci_irq_levels', 'pci_irq_levels_vmstate'], > 'piix4_pm': ['dev', 'parent_obj', 'pci0_status', > 'acpi_pci_hotplug.acpi_pcihp_pci_status[0x0]', > 'pm1a.sts', 'ar.pm1.evt.sts', 'pm1a.en', 'ar.pm1.evt.en', > 'pm1_cnt.cnt', 'ar.pm1.cnt.cnt', > 'tmr.timer', 'ar.tmr.timer', > 'tmr.overflow_time', 'ar.tmr.overflow_time', > 'gpe', 'ar.gpe'], > 'rtl8139': ['dev', 'parent_obj'], > 'qxl': ['num_surfaces', 'ssd.num_surfaces'], > 'usb-ccid': ['abProtocolDataStructure', 'abProtocolDataStructure.data'], > 'usb-host': ['dev', 'parent_obj'], > 'usb-mouse': ['usb-ptr-queue', 'HIDPointerEventQueue'], > 'usb-tablet': ['usb-ptr-queue', 'HIDPointerEventQueue'], > 'vmware_vga': ['card', 'parent_obj'], > 'vmware_vga_internal': ['depth', 'new_depth'], > 'xhci': ['pci_dev', 'parent_obj'], > 'x3130-upstream': ['PCIDevice', 'PCIEDevice'], > 'xio3130-express-downstream-port': ['port.br.dev', > 'parent_obj.parent_obj.parent_obj', > 'port.br.dev.exp.aer_log', > 'parent_obj.parent_obj.parent_obj.exp.aer_log'], > 'xio3130-downstream': ['PCIDevice', 'PCIEDevice'], > 'xio3130-express-upstream-port': ['br.dev', 'parent_obj.parent_obj', > 'br.dev.exp.aer_log', > 'parent_obj.parent_obj.exp.aer_log'], > 'spapr_pci': ['dma_liobn[0]', 'mig_liobn', > 'mem_win_addr', 'mig_mem_win_addr', > 'mem_win_size', 'mig_mem_win_size', > 'io_win_addr', 'mig_io_win_addr', > 'io_win_size', 'mig_io_win_size'], > } > > if not name in changed_names: > return False > > if s_field in changed_names[name] and d_field in changed_names[name]: > return True > > return False Ouch, I didn't notice that.. Ok that's a fair request then! > > > >> > >> > I'd still keep running this before soft-freeze like I used to > >> > do, throw issues to others and urge them to fix before release. > >> > >> Having hidden procedures that maintainers run before a release is bad > >> IMHO, it just delays the catching of bugs and frustrates > >> contributors. Imagine working on a series, everything goes well with > >> reviews, CI passes, patch gets queued and merged and a month later you > >> get a ping about something you should have done to avoid breaking > >> migration. Right during freeze. > > > > I understand your point, however I don't yet see a way CI could cover > > everything. CI won't cover performance test, and I still ran multifd tests > > at least smoke it too before soft-freeze. > > > > If there's something done wrong, we notify the sooner the better. Now it > > looks to me the best trade-off is like that - we notify at soft-freeze once > > per release considering that's pretty rare too, e.g. 9.0 has no such outlier. > > We should notify before merging any code. If there is a tool that can > catch the bug, there's no point in wasting everyone's time carrying that > patch forward. That's the purpose of a CI setup. These tests are very > light in terms of resources. > > Now, you're making the point that these issues are rare, which they may > very well be, I haven't ran this long enough to know. I also don't know > how to measure that, I believe our CI already has a large amount of > tests that never catch anything, so I'm not sure how much we should take > it in consideration when adding tests to the CI. > > About the possibly false results, we'll have to hold merging this for a > few releases and gather more data, see how much of a bother it is to > keep updating these fields. I expect to have more bugs caught by the > script than any renames happening. When looking (trying to fill up the whilelist for 8.2 machine type on "esp"), I found one more thing. See: commit b46a43a2241476d13486e016a0809b690b65f90e Author: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> Date: Fri Jan 12 12:53:48 2024 +0000 esp.c: remove unused PDMA callback implementation Note that this is a migration break for the q800 machine because the extra PDMA information is no longer included. Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> Tested-by: Helge Deller <deller@gmx.de> Tested-by: Thomas Huth <thuth@redhat.com> Message-Id: <20240112125420.514425-57-mark.cave-ayland@ilande.co.uk> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> It deliberately wants to break migration. That might be a problem now since it means it'll reliably fail the vmstate checker CI test, then people'll say "this is expected", then should the maintainer merge such change? As long as merged, everything running the vmstate checker will fail it always. Maybe we'll need to add another whiltelist to ignore some devices? It gets a bit hairy..
diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml index 91c57efded..bc7ac35983 100644 --- a/.gitlab-ci.d/buildtest.yml +++ b/.gitlab-ci.d/buildtest.yml @@ -202,18 +202,47 @@ build-previous-qemu: needs: - job: build-previous-qemu - job: build-system-opensuse - # The old QEMU could have bugs unrelated to migration that are - # already fixed in the current development branch, so this test - # might fail. + # This test is allowed to fail because: + # + # - The old QEMU could have bugs unrelated to migration that are + # already fixed in the current development branch. + # + # - The vmstate-static-checker script trips on renames and other + # backward-compatible changes to the vmstate structs. allow_failure: true variables: IMAGE: opensuse-leap MAKE_CHECK_ARGS: check-build script: - # Use the migration-tests from the older QEMU tree. This avoids - # testing an old QEMU against new features/tests that it is not - # compatible with. - - cd build-previous + - cd build + # device state static test: Tests the vmstate structures for + # compatibility across QEMU versions. Uses the latest version of + # the tests. + # old to new + - PYTHON=pyvenv/bin/python3 + QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-${TARGET} + QTEST_QEMU_BINARY=./qemu-system-${TARGET} + ./tests/qtest/migration-test -p /${TARGET}/migration/vmstate-checker-script + # new to old skipped because vmstate version bumps are always + # backward incompatible. + + # device state runtime test: Performs a cross-version migration + # with a select list of devices (see DEFAULT_DEVICES in + # migration-test.c). Using the multifd tcp test here, but any will + # do. + # old to new + - QTEST_DEVICE_OPTS=all QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-${TARGET} + QTEST_QEMU_BINARY=./qemu-system-${TARGET} ./tests/qtest/migration-test + -p /${TARGET}/migration/multifd/tcp/channels/plain/none + # new to old + - QTEST_DEVICE_OPTS=all QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-${TARGET} + QTEST_QEMU_BINARY=./qemu-system-${TARGET} ./tests/qtest/migration-test + -p /${TARGET}/migration/multifd/tcp/channels/plain/none + + # migration core tests: Use the migration-tests from the older + # QEMU tree. This avoids testing an old QEMU against new + # features/tests that it is not compatible with. + - cd ../build-previous # old to new - QTEST_QEMU_BINARY_SRC=./qemu-system-${TARGET} QTEST_QEMU_BINARY=../build/qemu-system-${TARGET} ./tests/qtest/migration-test
We have two new migration tests that check cross version compatibility. One uses the vmstate-static-checker.py script to compare the vmstate structures from two different QEMU versions. The other runs a simple migration with a few devices present in the VM, to catch obvious breakages. Add both tests to the migration-compat-common job. Signed-off-by: Fabiano Rosas <farosas@suse.de> --- .gitlab-ci.d/buildtest.yml | 43 +++++++++++++++++++++++++++++++------- 1 file changed, 36 insertions(+), 7 deletions(-)