mbox series

[0/5] colo: Introduce resource agent and test suite/CI

Message ID cover.1589199922.git.lukasstraub2@web.de (mailing list archive)
Headers show
Series colo: Introduce resource agent and test suite/CI | expand

Message

Lukas Straub May 11, 2020, 12:26 p.m. UTC
Hello Everyone,
These patches introduce a resource agent for fully automatic management of colo
and a test suite building upon the resource agent to extensively test colo.

Test suite features:
-Tests failover with peer crashing and hanging and failover during checkpoint
-Tests network using ssh and iperf3
-Quick test requires no special configuration
-Network test for testing colo-compare
-Stress test: failover all the time with network load

Resource agent features:
-Fully automatic management of colo
-Handles many failures: hanging/crashing qemu, replication error, disk error, ...
-Recovers from hanging qemu by using the "yank" oob command
-Tracks which node has up-to-date data
-Works well in clusters with more than 2 nodes

Run times on my laptop:
Quick test: 200s
Network test: 800s (tagged as slow)
Stress test: 1300s (tagged as slow)

The test suite needs access to a network bridge to properly test the network,
so some parameters need to be given to the test run. See
tests/acceptance/colo.py for more information.

I wonder how this integrates in existing CI infrastructure. Is there a common
CI for qemu where this can run or does every subsystem have to run their own
CI?

Regards,
Lukas Straub


Lukas Straub (5):
  block/quorum.c: stable children names
  colo: Introduce resource agent
  colo: Introduce high-level test suite
  configure,Makefile: Install colo resource-agent
  MAINTAINERS: Add myself as maintainer for COLO resource agent

 MAINTAINERS                              |    6 +
 Makefile                                 |    5 +
 block/quorum.c                           |   20 +-
 configure                                |   10 +
 scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
 scripts/colo-resource-agent/crm_master   |   44 +
 scripts/colo-resource-agent/crm_resource |   12 +
 tests/acceptance/colo.py                 |  689 +++++++++++
 8 files changed, 2209 insertions(+), 6 deletions(-)
 create mode 100755 scripts/colo-resource-agent/colo
 create mode 100755 scripts/colo-resource-agent/crm_master
 create mode 100755 scripts/colo-resource-agent/crm_resource
 create mode 100644 tests/acceptance/colo.py

Comments

Zhang Chen May 18, 2020, 9:38 a.m. UTC | #1
> -----Original Message-----
> From: Lukas Straub <lukasstraub2@web.de>
> Sent: Monday, May 11, 2020 8:27 PM
> To: qemu-devel <qemu-devel@nongnu.org>
> Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> Subject: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> 
> Hello Everyone,
> These patches introduce a resource agent for fully automatic management of
> colo and a test suite building upon the resource agent to extensively test colo.
> 
> Test suite features:
> -Tests failover with peer crashing and hanging and failover during checkpoint
> -Tests network using ssh and iperf3 -Quick test requires no special
> configuration -Network test for testing colo-compare -Stress test: failover all
> the time with network load
> 
> Resource agent features:
> -Fully automatic management of colo
> -Handles many failures: hanging/crashing qemu, replication error, disk
> error, ...
> -Recovers from hanging qemu by using the "yank" oob command -Tracks
> which node has up-to-date data -Works well in clusters with more than 2
> nodes
> 
> Run times on my laptop:
> Quick test: 200s
> Network test: 800s (tagged as slow)
> Stress test: 1300s (tagged as slow)
> 
> The test suite needs access to a network bridge to properly test the network,
> so some parameters need to be given to the test run. See
> tests/acceptance/colo.py for more information.
> 
> I wonder how this integrates in existing CI infrastructure. Is there a common
> CI for qemu where this can run or does every subsystem have to run their
> own CI?

Wow~ Very happy to see this series.
I have checked the "how to" in tests/acceptance/colo.py,
But it looks not enough for users, can you write an independent document for this series?
Include test Infrastructure ASC II diagram,  test cases design , detailed how to and more information for 
pacemaker cluster and resource agent..etc ?

Thanks
Zhang Chen


> 
> Regards,
> Lukas Straub
> 
> 
> Lukas Straub (5):
>   block/quorum.c: stable children names
>   colo: Introduce resource agent
>   colo: Introduce high-level test suite
>   configure,Makefile: Install colo resource-agent
>   MAINTAINERS: Add myself as maintainer for COLO resource agent
> 
>  MAINTAINERS                              |    6 +
>  Makefile                                 |    5 +
>  block/quorum.c                           |   20 +-
>  configure                                |   10 +
>  scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
>  scripts/colo-resource-agent/crm_master   |   44 +
>  scripts/colo-resource-agent/crm_resource |   12 +
>  tests/acceptance/colo.py                 |  689 +++++++++++
>  8 files changed, 2209 insertions(+), 6 deletions(-)  create mode 100755
> scripts/colo-resource-agent/colo  create mode 100755 scripts/colo-resource-
> agent/crm_master
>  create mode 100755 scripts/colo-resource-agent/crm_resource
>  create mode 100644 tests/acceptance/colo.py
> 
> --
> 2.20.1
Lukas Straub June 6, 2020, 6:59 p.m. UTC | #2
On Mon, 18 May 2020 09:38:24 +0000
"Zhang, Chen" <chen.zhang@intel.com> wrote:

> > -----Original Message-----
> > From: Lukas Straub <lukasstraub2@web.de>
> > Sent: Monday, May 11, 2020 8:27 PM
> > To: qemu-devel <qemu-devel@nongnu.org>
> > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> > Subject: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> > 
> > Hello Everyone,
> > These patches introduce a resource agent for fully automatic management of
> > colo and a test suite building upon the resource agent to extensively test colo.
> > 
> > Test suite features:
> > -Tests failover with peer crashing and hanging and failover during checkpoint
> > -Tests network using ssh and iperf3 -Quick test requires no special
> > configuration -Network test for testing colo-compare -Stress test: failover all
> > the time with network load
> > 
> > Resource agent features:
> > -Fully automatic management of colo
> > -Handles many failures: hanging/crashing qemu, replication error, disk
> > error, ...
> > -Recovers from hanging qemu by using the "yank" oob command -Tracks
> > which node has up-to-date data -Works well in clusters with more than 2
> > nodes
> > 
> > Run times on my laptop:
> > Quick test: 200s
> > Network test: 800s (tagged as slow)
> > Stress test: 1300s (tagged as slow)
> > 
> > The test suite needs access to a network bridge to properly test the network,
> > so some parameters need to be given to the test run. See
> > tests/acceptance/colo.py for more information.
> > 
> > I wonder how this integrates in existing CI infrastructure. Is there a common
> > CI for qemu where this can run or does every subsystem have to run their
> > own CI?  
> 
> Wow~ Very happy to see this series.
> I have checked the "how to" in tests/acceptance/colo.py,
> But it looks not enough for users, can you write an independent document for this series?
> Include test Infrastructure ASC II diagram,  test cases design , detailed how to and more information for 
> pacemaker cluster and resource agent..etc ?

Hi,
I quickly created a more complete howto for configuring a pacemaker cluster and using the resource agent, I hope it helps:
https://wiki.qemu.org/Features/COLO/Managed_HOWTO

Regards,
Lukas Straub

> Thanks
> Zhang Chen
> 
> 
> > 
> > Regards,
> > Lukas Straub
> > 
> > 
> > Lukas Straub (5):
> >   block/quorum.c: stable children names
> >   colo: Introduce resource agent
> >   colo: Introduce high-level test suite
> >   configure,Makefile: Install colo resource-agent
> >   MAINTAINERS: Add myself as maintainer for COLO resource agent
> > 
> >  MAINTAINERS                              |    6 +
> >  Makefile                                 |    5 +
> >  block/quorum.c                           |   20 +-
> >  configure                                |   10 +
> >  scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
> >  scripts/colo-resource-agent/crm_master   |   44 +
> >  scripts/colo-resource-agent/crm_resource |   12 +
> >  tests/acceptance/colo.py                 |  689 +++++++++++
> >  8 files changed, 2209 insertions(+), 6 deletions(-)  create mode 100755
> > scripts/colo-resource-agent/colo  create mode 100755 scripts/colo-resource-
> > agent/crm_master
> >  create mode 100755 scripts/colo-resource-agent/crm_resource
> >  create mode 100644 tests/acceptance/colo.py
> > 
> > --
> > 2.20.1
Zhang Chen June 16, 2020, 1:42 a.m. UTC | #3
> -----Original Message-----
> From: Lukas Straub <lukasstraub2@web.de>
> Sent: Sunday, June 7, 2020 3:00 AM
> To: Zhang, Chen <chen.zhang@intel.com>
> Cc: qemu-devel <qemu-devel@nongnu.org>; Alberto Garcia
> <berto@igalia.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Jason
> Wang <jasowang@redhat.com>
> Subject: Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> 
> On Mon, 18 May 2020 09:38:24 +0000
> "Zhang, Chen" <chen.zhang@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: Lukas Straub <lukasstraub2@web.de>
> > > Sent: Monday, May 11, 2020 8:27 PM
> > > To: qemu-devel <qemu-devel@nongnu.org>
> > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> > > Subject: [PATCH 0/5] colo: Introduce resource agent and test
> > > suite/CI
> > >
> > > Hello Everyone,
> > > These patches introduce a resource agent for fully automatic
> > > management of colo and a test suite building upon the resource agent to
> extensively test colo.
> > >
> > > Test suite features:
> > > -Tests failover with peer crashing and hanging and failover during
> > > checkpoint -Tests network using ssh and iperf3 -Quick test requires
> > > no special configuration -Network test for testing colo-compare
> > > -Stress test: failover all the time with network load
> > >
> > > Resource agent features:
> > > -Fully automatic management of colo
> > > -Handles many failures: hanging/crashing qemu, replication error,
> > > disk error, ...
> > > -Recovers from hanging qemu by using the "yank" oob command -Tracks
> > > which node has up-to-date data -Works well in clusters with more
> > > than 2 nodes
> > >
> > > Run times on my laptop:
> > > Quick test: 200s
> > > Network test: 800s (tagged as slow)
> > > Stress test: 1300s (tagged as slow)
> > >
> > > The test suite needs access to a network bridge to properly test the
> > > network, so some parameters need to be given to the test run. See
> > > tests/acceptance/colo.py for more information.
> > >
> > > I wonder how this integrates in existing CI infrastructure. Is there
> > > a common CI for qemu where this can run or does every subsystem have
> > > to run their own CI?
> >
> > Wow~ Very happy to see this series.
> > I have checked the "how to" in tests/acceptance/colo.py, But it looks
> > not enough for users, can you write an independent document for this
> series?
> > Include test Infrastructure ASC II diagram,  test cases design ,
> > detailed how to and more information for pacemaker cluster and resource
> agent..etc ?
> 
> Hi,
> I quickly created a more complete howto for configuring a pacemaker cluster
> and using the resource agent, I hope it helps:
> https://wiki.qemu.org/Features/COLO/Managed_HOWTO

Hi Lukas,

I noticed you contribute some content in Qemu COLO WIKI.
For the Features/COLO/Manual HOWTO
https://wiki.qemu.org/Features/COLO/Manual_HOWTO

Why not keep the Secondary side start command same with the qemu/docs/COLO-FT.txt?
If I understand correctly, add the quorum related command in secondary will support resume replication.
Then, we can add primary/secondary resume step here.

Thanks
Zhang Chen

> 
> Regards,
> Lukas Straub
> 
> > Thanks
> > Zhang Chen
> >
> >
> > >
> > > Regards,
> > > Lukas Straub
> > >
> > >
> > > Lukas Straub (5):
> > >   block/quorum.c: stable children names
> > >   colo: Introduce resource agent
> > >   colo: Introduce high-level test suite
> > >   configure,Makefile: Install colo resource-agent
> > >   MAINTAINERS: Add myself as maintainer for COLO resource agent
> > >
> > >  MAINTAINERS                              |    6 +
> > >  Makefile                                 |    5 +
> > >  block/quorum.c                           |   20 +-
> > >  configure                                |   10 +
> > >  scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
> > >  scripts/colo-resource-agent/crm_master   |   44 +
> > >  scripts/colo-resource-agent/crm_resource |   12 +
> > >  tests/acceptance/colo.py                 |  689 +++++++++++
> > >  8 files changed, 2209 insertions(+), 6 deletions(-)  create mode
> > > 100755 scripts/colo-resource-agent/colo  create mode 100755
> > > scripts/colo-resource- agent/crm_master  create mode 100755
> > > scripts/colo-resource-agent/crm_resource
> > >  create mode 100644 tests/acceptance/colo.py
> > >
> > > --
> > > 2.20.1
Lukas Straub June 19, 2020, 1:55 p.m. UTC | #4
On Tue, 16 Jun 2020 01:42:45 +0000
"Zhang, Chen" <chen.zhang@intel.com> wrote:

> > -----Original Message-----
> > From: Lukas Straub <lukasstraub2@web.de>
> > Sent: Sunday, June 7, 2020 3:00 AM
> > To: Zhang, Chen <chen.zhang@intel.com>
> > Cc: qemu-devel <qemu-devel@nongnu.org>; Alberto Garcia
> > <berto@igalia.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Jason
> > Wang <jasowang@redhat.com>
> > Subject: Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> > 
> > On Mon, 18 May 2020 09:38:24 +0000
> > "Zhang, Chen" <chen.zhang@intel.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Lukas Straub <lukasstraub2@web.de>
> > > > Sent: Monday, May 11, 2020 8:27 PM
> > > > To: qemu-devel <qemu-devel@nongnu.org>
> > > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> > > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> > > > Subject: [PATCH 0/5] colo: Introduce resource agent and test
> > > > suite/CI
> > > >
> > > > Hello Everyone,
> > > > These patches introduce a resource agent for fully automatic
> > > > management of colo and a test suite building upon the resource agent to  
> > extensively test colo.  
> > > >
> > > > Test suite features:
> > > > -Tests failover with peer crashing and hanging and failover during
> > > > checkpoint -Tests network using ssh and iperf3 -Quick test requires
> > > > no special configuration -Network test for testing colo-compare
> > > > -Stress test: failover all the time with network load
> > > >
> > > > Resource agent features:
> > > > -Fully automatic management of colo
> > > > -Handles many failures: hanging/crashing qemu, replication error,
> > > > disk error, ...
> > > > -Recovers from hanging qemu by using the "yank" oob command -Tracks
> > > > which node has up-to-date data -Works well in clusters with more
> > > > than 2 nodes
> > > >
> > > > Run times on my laptop:
> > > > Quick test: 200s
> > > > Network test: 800s (tagged as slow)
> > > > Stress test: 1300s (tagged as slow)
> > > >
> > > > The test suite needs access to a network bridge to properly test the
> > > > network, so some parameters need to be given to the test run. See
> > > > tests/acceptance/colo.py for more information.
> > > >
> > > > I wonder how this integrates in existing CI infrastructure. Is there
> > > > a common CI for qemu where this can run or does every subsystem have
> > > > to run their own CI?  
> > >
> > > Wow~ Very happy to see this series.
> > > I have checked the "how to" in tests/acceptance/colo.py, But it looks
> > > not enough for users, can you write an independent document for this  
> > series?  
> > > Include test Infrastructure ASC II diagram,  test cases design ,
> > > detailed how to and more information for pacemaker cluster and resource  
> > agent..etc ?
> > 
> > Hi,
> > I quickly created a more complete howto for configuring a pacemaker cluster
> > and using the resource agent, I hope it helps:
> > https://wiki.qemu.org/Features/COLO/Managed_HOWTO  
> 
> Hi Lukas,
> 
> I noticed you contribute some content in Qemu COLO WIKI.
> For the Features/COLO/Manual HOWTO
> https://wiki.qemu.org/Features/COLO/Manual_HOWTO
> 
> Why not keep the Secondary side start command same with the qemu/docs/COLO-FT.txt?
> If I understand correctly, add the quorum related command in secondary will support resume replication.
> Then, we can add primary/secondary resume step here.

I haven't updated the wiki from qemu/docs/COLO-FT.txt yet, I just moved it there from the main page.

Regards,
Lukas Straub

> Thanks
> Zhang Chen