Message ID | cover.1574356137.git.lukasstraub2@web.de (mailing list archive) |
---|---|
Headers | show |
Series | colo: Introduce resource agent and high-level test | expand |
* Lukas Straub (lukasstraub2@web.de) wrote: > Hello Everyone, > These patches introduce a resource agent for use with the Pacemaker CRM and a > high-level test utilizing it for testing qemu COLO. > > The resource agent manages qemu COLO including continuous replication. > > Currently the second test case (where the peer qemu is frozen) fails on primary > failover, because qemu hangs while removing the replication related block nodes. > Note that this also happens in real world test when cutting power to the peer > host, so this needs to be fixed. Do you understand why that happens? Is this it's trying to finish a read/write to the dead partner? Dave > Based-on: <cover.1571925699.git.lukasstraub2@web.de> > ([PATCH v7 0/4] colo: Add support for continuous replication) > > Lukas Straub (4): > block/quorum.c: stable children names > colo: Introduce resource agent > colo: Introduce high-level test > MAINTAINERS: Add myself as maintainer for COLO resource agent > > MAINTAINERS | 6 + > block/quorum.c | 6 + > scripts/colo-resource-agent/colo | 1026 ++++++++++++++++++++++++ > scripts/colo-resource-agent/crm_master | 44 + > tests/acceptance/colo.py | 444 ++++++++++ > 5 files changed, 1526 insertions(+) > create mode 100755 scripts/colo-resource-agent/colo > create mode 100755 scripts/colo-resource-agent/crm_master > create mode 100644 tests/acceptance/colo.py > > -- > 2.20.1 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, 22 Nov 2019 09:46:46 +0000 "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > * Lukas Straub (lukasstraub2@web.de) wrote: > > Hello Everyone, > > These patches introduce a resource agent for use with the Pacemaker CRM and a > > high-level test utilizing it for testing qemu COLO. > > > > The resource agent manages qemu COLO including continuous replication. > > > > Currently the second test case (where the peer qemu is frozen) fails on primary > > failover, because qemu hangs while removing the replication related block nodes. > > Note that this also happens in real world test when cutting power to the peer > > host, so this needs to be fixed. > > Do you understand why that happens? Is this it's trying to finish a > read/write to the dead partner? > > Dave I haven't looked into it too closely yet, but it's often hanging in bdrv_flush() while removing the replication blockdev and of course thats probably because the nbd client waits for a reply. So I tried with the workaround below, which will actively kill the TCP connection and with it the test passes, though I haven't tested it in real world yet. A proper solution to this would probably be a "force" parameter for blockdev-del, which skips all flushing and aborts all inflight io. Or we could add a timeout to the nbd client. Regards, Lukas Straub diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo index 5fd9cfc0b5..62210af2a1 100755 --- a/scripts/colo-resource-agent/colo +++ b/scripts/colo-resource-agent/colo @@ -935,6 +935,7 @@ def qemu_colo_notify(): and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname): fd = qmp_open() peer = qmp_get_nbd_remote(fd) + os.system("sudo ss -K dst %s dport = %s" % (peer, NBD_PORT)) if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname): if qmp_check_resync(fd) != None: qmp_cancel_resync(fd)
On Wed, 27 Nov 2019 22:11:34 +0100 Lukas Straub <lukasstraub2@web.de> wrote: > On Fri, 22 Nov 2019 09:46:46 +0000 > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > > > * Lukas Straub (lukasstraub2@web.de) wrote: > > > Hello Everyone, > > > These patches introduce a resource agent for use with the Pacemaker CRM and a > > > high-level test utilizing it for testing qemu COLO. > > > > > > The resource agent manages qemu COLO including continuous replication. > > > > > > Currently the second test case (where the peer qemu is frozen) fails on primary > > > failover, because qemu hangs while removing the replication related block nodes. > > > Note that this also happens in real world test when cutting power to the peer > > > host, so this needs to be fixed. > > > > Do you understand why that happens? Is this it's trying to finish a > > read/write to the dead partner? > > > > Dave > > I haven't looked into it too closely yet, but it's often hanging in bdrv_flush() > while removing the replication blockdev and of course thats probably because the > nbd client waits for a reply. So I tried with the workaround below, which will > actively kill the TCP connection and with it the test passes, though I haven't > tested it in real world yet. > In the real cluster, sometimes qemu even hangs while connecting to qmp (after remote poweroff). But I currently don't have the time to look into it. Still a failing test is better than no test. Could we mark this test as known-bad and fix this issue later? How should I mark it as known-bad? By tag? Or warn in the log? Regards, Lukas Straub
* Lukas Straub (lukasstraub2@web.de) wrote: > On Wed, 27 Nov 2019 22:11:34 +0100 > Lukas Straub <lukasstraub2@web.de> wrote: > > > On Fri, 22 Nov 2019 09:46:46 +0000 > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > > > > > * Lukas Straub (lukasstraub2@web.de) wrote: > > > > Hello Everyone, > > > > These patches introduce a resource agent for use with the Pacemaker CRM and a > > > > high-level test utilizing it for testing qemu COLO. > > > > > > > > The resource agent manages qemu COLO including continuous replication. > > > > > > > > Currently the second test case (where the peer qemu is frozen) fails on primary > > > > failover, because qemu hangs while removing the replication related block nodes. > > > > Note that this also happens in real world test when cutting power to the peer > > > > host, so this needs to be fixed. > > > > > > Do you understand why that happens? Is this it's trying to finish a > > > read/write to the dead partner? > > > > > > Dave > > > > I haven't looked into it too closely yet, but it's often hanging in bdrv_flush() > > while removing the replication blockdev and of course thats probably because the > > nbd client waits for a reply. So I tried with the workaround below, which will > > actively kill the TCP connection and with it the test passes, though I haven't > > tested it in real world yet. > > > > In the real cluster, sometimes qemu even hangs while connecting to qmp (after remote > poweroff). But I currently don't have the time to look into it. That doesn't surprise me too much; QMP is mostly handled in the main thread, as are a lot of other things; hanging in COLO has been my assumption for a while because of that. However, there's a way to fix it. A while ago, Peter Xu added a feature called 'out of band' to QMP; you can open a QMP connection, set the OOB feature, and then commands that are marked as OOB are executed off the main thread on that connection. At the moment we've just got the one real OOB command, 'migrate-recover' which is used for recovering postcopy from a similar failure to the COLO case. To fix this you'd have to convert colo-lost-heartbeat to be an OOB command; note it's not that trivial, because you have to make sure the code that's run as part of the OOB command doesn't take any locks that could block on something in the main thread; so it can set flags, start new threads, perhaps call shutdown() on a socket; but it takes some thinking about. > Still a failing test is better than no test. Could we mark this test as known-bad and > fix this issue later? How should I mark it as known-bad? By tag? Or warn in the log? Not sure of that; cc'ing Maybe thuth knows? Dave > Regards, > Lukas Straub > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK