=== event tapping ===
Event tapping is the core component of Kemari, and it decides on which event the
primary should synchronize with the secondary. The basic assumption here is
that outgoing I/O operations are idempotent, which is usually true for disk I/O
and reliable network protocols such as TCP.
As discussed in the following thread, we may need to reconsider how and when to
start VM synchronization.
http://www.mail-archive.com/kvm@vger.kernel.org/msg31908.html
We would like get as much feedbacks on current implementation before
thinking/going into the next approach.
TODO:
- virtio polling
- support for asynchronous I/O methods (eventfd)
=== sender / receiver ===
To synchronize virtual machines, all the dirty pages since the last
synchronization point and the state of the VCPU the virtual devices is sent to
the fallback node from the user-space QEMU process.
TODO:
- Asynchronous VM transfer / pipelining (needed for SMP)
- Zero copy VM transfer
- VM transfer w/ RDMA
=== storage ===
Although Kemari needs some kind of shared storage, many users don't like it and
they expect to use Kemari in conjunction with software storage replication.
TODO:
- Integration with other non-shared disk cluster storage solutions
such as DRBD (might need changes to guarantee storage data
consistency at Kemari synchronization points).
- Integration with QEMU's block live migration functionality for
non-share disk configurations.
=== integration with HA stack (Pacemaker/Corosync) ===
Failover process kicks in whenever a failure in the primary node is detected.
For Kemari for Xen, we already have finished RA for Heartbeat, and planning to
integrate Kemari for KVM with the new HA stacks (Pacemaker, RHCS, etc).
Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.
TODO:
- RA for Pacemaker.
- Consider both HW failure and SW failure scenarios (failover
between Kemari clusters).
- Make the necessary changes to Pacemaker/Corosync to support
event(HW failure, etc)-driven failover.
- Take advantage of the RAS capabilities of newer CPUs/motherboards
such as MCE to trigger failover.
- Detect failures in I/O devices (block I/O errors, etc).
=== clock ===
Since synchronizing the virtual machines every time the TSC is accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.
TODO:
- Synchronization of clock sources (need to intercept TSC reads, etc).
=== usability ===
These are items that defines how users interact with Kemari.
TODO:
- Kemarid daemon that takes care of the cluster management/monitoring
side of things.
- Some device emulators might need minor modifications to work well
with Kemari. Use white(black)-listing to take the burden of
choosing the right device model off the users.
=== optimizations ===
Although the big picture can be realized by completing the TODO list above, we
need some optimizations/enhancements to make Kemari useful in real world, and
these are items what needs to be done for that.
TODO:
- SMP (for the sake of performance might need to implement a
synchronization protocol that can maintain two or more
synchronization points active at any given moment)
- VGA (leverage VNC's subtilting mechanism to identify fb pages that
are really dirty).
Any comments/suggestions would be greatly appreciated.
Thanks,
Yoshi
--
Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm.c | 11 ++++++++---
arch/x86/kvm/vmx.c | 11 ++++++++---
arch/x86/kvm/x86.c | 4 ++++
4 files changed, 21 insertions(+), 6 deletions(-)
@@ -227,6 +227,7 @@ struct kvm_pio_request {
int in;
int port;
int size;
+ bool lazy_skip;
};
/*
@@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
{
struct kvm_vcpu *vcpu = &svm->vcpu;
u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
- int size, in, string;
+ int size, in, string, ret;
unsigned port;
++svm->vcpu.stat.io_exits;
@@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
port = io_info >> 16;
size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT;
svm->next_rip = svm->vmcb->control.exit_info_2;
- skip_emulated_instruction(&svm->vcpu);
- return kvm_fast_pio_out(vcpu, size, port);
+ ret = kvm_fast_pio_out(vcpu, size, port);
+ if (ret)
+ skip_emulated_instruction(&svm->vcpu);
+ else
+ vcpu->arch.pio.lazy_skip = true;
+
+ return ret;
}
static int nmi_interception(struct vcpu_svm *svm)
@@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu *vcpu)
static int handle_io(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
- int size, in, string;
+ int size, in, string, ret;
unsigned port;
exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
port = exit_qualification >> 16;
size = (exit_qualification & 7) + 1;
- skip_emulated_instruction(vcpu);
- return kvm_fast_pio_out(vcpu, size, port);
+ ret = kvm_fast_pio_out(vcpu, size, port);
+ if (ret)
+ skip_emulated_instruction(vcpu);
+ else
+ vcpu->arch.pio.lazy_skip = true;
+
+ return ret;
}
static void
@@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
if (!irqchip_in_kernel(vcpu->kvm))
kvm_set_cr8(vcpu, kvm_run->cr8);
+ if (vcpu->arch.pio.lazy_skip)
+ kvm_x86_ops->skip_emulated_instruction(vcpu);
+ vcpu->arch.pio.lazy_skip = false;
+
if (vcpu->arch.pio.count || vcpu->mmio_needed ||
vcpu->arch.emulate_ctxt.restart) {
if (vcpu->mmio_needed) {