Message ID | cover.1589199922.git.lukasstraub2@web.de (mailing list archive) |
---|---|
Headers | show |
Series | colo: Introduce resource agent and test suite/CI | expand |
> -----Original Message----- > From: Lukas Straub <lukasstraub2@web.de> > Sent: Monday, May 11, 2020 8:27 PM > To: qemu-devel <qemu-devel@nongnu.org> > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com> > Subject: [PATCH 0/5] colo: Introduce resource agent and test suite/CI > > Hello Everyone, > These patches introduce a resource agent for fully automatic management of > colo and a test suite building upon the resource agent to extensively test colo. > > Test suite features: > -Tests failover with peer crashing and hanging and failover during checkpoint > -Tests network using ssh and iperf3 -Quick test requires no special > configuration -Network test for testing colo-compare -Stress test: failover all > the time with network load > > Resource agent features: > -Fully automatic management of colo > -Handles many failures: hanging/crashing qemu, replication error, disk > error, ... > -Recovers from hanging qemu by using the "yank" oob command -Tracks > which node has up-to-date data -Works well in clusters with more than 2 > nodes > > Run times on my laptop: > Quick test: 200s > Network test: 800s (tagged as slow) > Stress test: 1300s (tagged as slow) > > The test suite needs access to a network bridge to properly test the network, > so some parameters need to be given to the test run. See > tests/acceptance/colo.py for more information. > > I wonder how this integrates in existing CI infrastructure. Is there a common > CI for qemu where this can run or does every subsystem have to run their > own CI? Wow~ Very happy to see this series. I have checked the "how to" in tests/acceptance/colo.py, But it looks not enough for users, can you write an independent document for this series? Include test Infrastructure ASC II diagram, test cases design , detailed how to and more information for pacemaker cluster and resource agent..etc ? Thanks Zhang Chen > > Regards, > Lukas Straub > > > Lukas Straub (5): > block/quorum.c: stable children names > colo: Introduce resource agent > colo: Introduce high-level test suite > configure,Makefile: Install colo resource-agent > MAINTAINERS: Add myself as maintainer for COLO resource agent > > MAINTAINERS | 6 + > Makefile | 5 + > block/quorum.c | 20 +- > configure | 10 + > scripts/colo-resource-agent/colo | 1429 ++++++++++++++++++++++ > scripts/colo-resource-agent/crm_master | 44 + > scripts/colo-resource-agent/crm_resource | 12 + > tests/acceptance/colo.py | 689 +++++++++++ > 8 files changed, 2209 insertions(+), 6 deletions(-) create mode 100755 > scripts/colo-resource-agent/colo create mode 100755 scripts/colo-resource- > agent/crm_master > create mode 100755 scripts/colo-resource-agent/crm_resource > create mode 100644 tests/acceptance/colo.py > > -- > 2.20.1
On Mon, 18 May 2020 09:38:24 +0000 "Zhang, Chen" <chen.zhang@intel.com> wrote: > > -----Original Message----- > > From: Lukas Straub <lukasstraub2@web.de> > > Sent: Monday, May 11, 2020 8:27 PM > > To: qemu-devel <qemu-devel@nongnu.org> > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com> > > Subject: [PATCH 0/5] colo: Introduce resource agent and test suite/CI > > > > Hello Everyone, > > These patches introduce a resource agent for fully automatic management of > > colo and a test suite building upon the resource agent to extensively test colo. > > > > Test suite features: > > -Tests failover with peer crashing and hanging and failover during checkpoint > > -Tests network using ssh and iperf3 -Quick test requires no special > > configuration -Network test for testing colo-compare -Stress test: failover all > > the time with network load > > > > Resource agent features: > > -Fully automatic management of colo > > -Handles many failures: hanging/crashing qemu, replication error, disk > > error, ... > > -Recovers from hanging qemu by using the "yank" oob command -Tracks > > which node has up-to-date data -Works well in clusters with more than 2 > > nodes > > > > Run times on my laptop: > > Quick test: 200s > > Network test: 800s (tagged as slow) > > Stress test: 1300s (tagged as slow) > > > > The test suite needs access to a network bridge to properly test the network, > > so some parameters need to be given to the test run. See > > tests/acceptance/colo.py for more information. > > > > I wonder how this integrates in existing CI infrastructure. Is there a common > > CI for qemu where this can run or does every subsystem have to run their > > own CI? > > Wow~ Very happy to see this series. > I have checked the "how to" in tests/acceptance/colo.py, > But it looks not enough for users, can you write an independent document for this series? > Include test Infrastructure ASC II diagram, test cases design , detailed how to and more information for > pacemaker cluster and resource agent..etc ? Hi, I quickly created a more complete howto for configuring a pacemaker cluster and using the resource agent, I hope it helps: https://wiki.qemu.org/Features/COLO/Managed_HOWTO Regards, Lukas Straub > Thanks > Zhang Chen > > > > > > Regards, > > Lukas Straub > > > > > > Lukas Straub (5): > > block/quorum.c: stable children names > > colo: Introduce resource agent > > colo: Introduce high-level test suite > > configure,Makefile: Install colo resource-agent > > MAINTAINERS: Add myself as maintainer for COLO resource agent > > > > MAINTAINERS | 6 + > > Makefile | 5 + > > block/quorum.c | 20 +- > > configure | 10 + > > scripts/colo-resource-agent/colo | 1429 ++++++++++++++++++++++ > > scripts/colo-resource-agent/crm_master | 44 + > > scripts/colo-resource-agent/crm_resource | 12 + > > tests/acceptance/colo.py | 689 +++++++++++ > > 8 files changed, 2209 insertions(+), 6 deletions(-) create mode 100755 > > scripts/colo-resource-agent/colo create mode 100755 scripts/colo-resource- > > agent/crm_master > > create mode 100755 scripts/colo-resource-agent/crm_resource > > create mode 100644 tests/acceptance/colo.py > > > > -- > > 2.20.1
> -----Original Message----- > From: Lukas Straub <lukasstraub2@web.de> > Sent: Sunday, June 7, 2020 3:00 AM > To: Zhang, Chen <chen.zhang@intel.com> > Cc: qemu-devel <qemu-devel@nongnu.org>; Alberto Garcia > <berto@igalia.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Jason > Wang <jasowang@redhat.com> > Subject: Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI > > On Mon, 18 May 2020 09:38:24 +0000 > "Zhang, Chen" <chen.zhang@intel.com> wrote: > > > > -----Original Message----- > > > From: Lukas Straub <lukasstraub2@web.de> > > > Sent: Monday, May 11, 2020 8:27 PM > > > To: qemu-devel <qemu-devel@nongnu.org> > > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert > > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com> > > > Subject: [PATCH 0/5] colo: Introduce resource agent and test > > > suite/CI > > > > > > Hello Everyone, > > > These patches introduce a resource agent for fully automatic > > > management of colo and a test suite building upon the resource agent to > extensively test colo. > > > > > > Test suite features: > > > -Tests failover with peer crashing and hanging and failover during > > > checkpoint -Tests network using ssh and iperf3 -Quick test requires > > > no special configuration -Network test for testing colo-compare > > > -Stress test: failover all the time with network load > > > > > > Resource agent features: > > > -Fully automatic management of colo > > > -Handles many failures: hanging/crashing qemu, replication error, > > > disk error, ... > > > -Recovers from hanging qemu by using the "yank" oob command -Tracks > > > which node has up-to-date data -Works well in clusters with more > > > than 2 nodes > > > > > > Run times on my laptop: > > > Quick test: 200s > > > Network test: 800s (tagged as slow) > > > Stress test: 1300s (tagged as slow) > > > > > > The test suite needs access to a network bridge to properly test the > > > network, so some parameters need to be given to the test run. See > > > tests/acceptance/colo.py for more information. > > > > > > I wonder how this integrates in existing CI infrastructure. Is there > > > a common CI for qemu where this can run or does every subsystem have > > > to run their own CI? > > > > Wow~ Very happy to see this series. > > I have checked the "how to" in tests/acceptance/colo.py, But it looks > > not enough for users, can you write an independent document for this > series? > > Include test Infrastructure ASC II diagram, test cases design , > > detailed how to and more information for pacemaker cluster and resource > agent..etc ? > > Hi, > I quickly created a more complete howto for configuring a pacemaker cluster > and using the resource agent, I hope it helps: > https://wiki.qemu.org/Features/COLO/Managed_HOWTO Hi Lukas, I noticed you contribute some content in Qemu COLO WIKI. For the Features/COLO/Manual HOWTO https://wiki.qemu.org/Features/COLO/Manual_HOWTO Why not keep the Secondary side start command same with the qemu/docs/COLO-FT.txt? If I understand correctly, add the quorum related command in secondary will support resume replication. Then, we can add primary/secondary resume step here. Thanks Zhang Chen > > Regards, > Lukas Straub > > > Thanks > > Zhang Chen > > > > > > > > > > Regards, > > > Lukas Straub > > > > > > > > > Lukas Straub (5): > > > block/quorum.c: stable children names > > > colo: Introduce resource agent > > > colo: Introduce high-level test suite > > > configure,Makefile: Install colo resource-agent > > > MAINTAINERS: Add myself as maintainer for COLO resource agent > > > > > > MAINTAINERS | 6 + > > > Makefile | 5 + > > > block/quorum.c | 20 +- > > > configure | 10 + > > > scripts/colo-resource-agent/colo | 1429 ++++++++++++++++++++++ > > > scripts/colo-resource-agent/crm_master | 44 + > > > scripts/colo-resource-agent/crm_resource | 12 + > > > tests/acceptance/colo.py | 689 +++++++++++ > > > 8 files changed, 2209 insertions(+), 6 deletions(-) create mode > > > 100755 scripts/colo-resource-agent/colo create mode 100755 > > > scripts/colo-resource- agent/crm_master create mode 100755 > > > scripts/colo-resource-agent/crm_resource > > > create mode 100644 tests/acceptance/colo.py > > > > > > -- > > > 2.20.1
On Tue, 16 Jun 2020 01:42:45 +0000 "Zhang, Chen" <chen.zhang@intel.com> wrote: > > -----Original Message----- > > From: Lukas Straub <lukasstraub2@web.de> > > Sent: Sunday, June 7, 2020 3:00 AM > > To: Zhang, Chen <chen.zhang@intel.com> > > Cc: qemu-devel <qemu-devel@nongnu.org>; Alberto Garcia > > <berto@igalia.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Jason > > Wang <jasowang@redhat.com> > > Subject: Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI > > > > On Mon, 18 May 2020 09:38:24 +0000 > > "Zhang, Chen" <chen.zhang@intel.com> wrote: > > > > > > -----Original Message----- > > > > From: Lukas Straub <lukasstraub2@web.de> > > > > Sent: Monday, May 11, 2020 8:27 PM > > > > To: qemu-devel <qemu-devel@nongnu.org> > > > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert > > > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com> > > > > Subject: [PATCH 0/5] colo: Introduce resource agent and test > > > > suite/CI > > > > > > > > Hello Everyone, > > > > These patches introduce a resource agent for fully automatic > > > > management of colo and a test suite building upon the resource agent to > > extensively test colo. > > > > > > > > Test suite features: > > > > -Tests failover with peer crashing and hanging and failover during > > > > checkpoint -Tests network using ssh and iperf3 -Quick test requires > > > > no special configuration -Network test for testing colo-compare > > > > -Stress test: failover all the time with network load > > > > > > > > Resource agent features: > > > > -Fully automatic management of colo > > > > -Handles many failures: hanging/crashing qemu, replication error, > > > > disk error, ... > > > > -Recovers from hanging qemu by using the "yank" oob command -Tracks > > > > which node has up-to-date data -Works well in clusters with more > > > > than 2 nodes > > > > > > > > Run times on my laptop: > > > > Quick test: 200s > > > > Network test: 800s (tagged as slow) > > > > Stress test: 1300s (tagged as slow) > > > > > > > > The test suite needs access to a network bridge to properly test the > > > > network, so some parameters need to be given to the test run. See > > > > tests/acceptance/colo.py for more information. > > > > > > > > I wonder how this integrates in existing CI infrastructure. Is there > > > > a common CI for qemu where this can run or does every subsystem have > > > > to run their own CI? > > > > > > Wow~ Very happy to see this series. > > > I have checked the "how to" in tests/acceptance/colo.py, But it looks > > > not enough for users, can you write an independent document for this > > series? > > > Include test Infrastructure ASC II diagram, test cases design , > > > detailed how to and more information for pacemaker cluster and resource > > agent..etc ? > > > > Hi, > > I quickly created a more complete howto for configuring a pacemaker cluster > > and using the resource agent, I hope it helps: > > https://wiki.qemu.org/Features/COLO/Managed_HOWTO > > Hi Lukas, > > I noticed you contribute some content in Qemu COLO WIKI. > For the Features/COLO/Manual HOWTO > https://wiki.qemu.org/Features/COLO/Manual_HOWTO > > Why not keep the Secondary side start command same with the qemu/docs/COLO-FT.txt? > If I understand correctly, add the quorum related command in secondary will support resume replication. > Then, we can add primary/secondary resume step here. I haven't updated the wiki from qemu/docs/COLO-FT.txt yet, I just moved it there from the main page. Regards, Lukas Straub > Thanks > Zhang Chen