From patchwork Sat Feb 27 07:54:59 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhanghailiang X-Patchwork-Id: 8443741 Return-Path: X-Original-To: patchwork-qemu-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id B5B33C0553 for ; Sat, 27 Feb 2016 07:56:01 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 4EF982042C for ; Sat, 27 Feb 2016 07:56:00 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 974DE203DA for ; Sat, 27 Feb 2016 07:55:57 +0000 (UTC) Received: from localhost ([::1]:53733 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aZZjI-0004lw-Lo for patchwork-qemu-devel@patchwork.kernel.org; Sat, 27 Feb 2016 02:55:56 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36441) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aZZj8-0004lh-77 for qemu-devel@nongnu.org; Sat, 27 Feb 2016 02:55:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aZZj4-0008J0-5x for qemu-devel@nongnu.org; Sat, 27 Feb 2016 02:55:46 -0500 Received: from szxga03-in.huawei.com ([119.145.14.66]:14542) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aZZj2-0008HP-V0 for qemu-devel@nongnu.org; Sat, 27 Feb 2016 02:55:42 -0500 Received: from 172.24.1.49 (EHLO szxeml433-hub.china.huawei.com) ([172.24.1.49]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id BWS21349; Sat, 27 Feb 2016 15:55:16 +0800 (CST) Received: from [127.0.0.1] (10.177.24.212) by szxeml433-hub.china.huawei.com (10.82.67.210) with Microsoft SMTP Server id 14.3.235.1; Sat, 27 Feb 2016 15:55:06 +0800 To: "Dr. David Alan Gilbert" References: <1456108832-24212-1-git-send-email-zhang.zhanghailiang@huawei.com> <20160225195232.GB18374@work-vm> <20160226163602.GM2161@work-vm> From: Hailiang Zhang Message-ID: <56D15653.90406@huawei.com> Date: Sat, 27 Feb 2016 15:54:59 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <20160226163602.GM2161@work-vm> X-Originating-IP: [10.177.24.212] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A090205.56D15665.0066, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0, ip=0.0.0.0, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: a27dd8a3d5f4523d5ba805129b8aae39 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] X-Received-From: 119.145.14.66 Cc: xiecl.fnst@cn.fujitsu.com, lizhijian@cn.fujitsu.com, quintela@redhat.com, armbru@redhat.com, yunhong.jiang@intel.com, eddie.dong@intel.com, peter.huangpeng@huawei.com, qemu-devel@nongnu.org, arei.gonglei@huawei.com, stefanha@redhat.com, amit.shah@redhat.com, zhangchen.fnst@cn.fujitsu.com, hongyang.yang@easystack.cn Subject: Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT) X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 2016/2/27 0:36, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: >> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote: >>> From: root >>> >>> This is the 15th version of COLO (Still only support periodic checkpoint). >>> >>> Here is only COLO frame part, you can get the whole codes from github: >>> https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode >>> >>> There are little changes for this series except the network releated part. >> >> I was looking at the time the guest is paused during COLO and >> was surprised to find one of the larger chunks was the time to reset >> the guest before loading each checkpoint; I've traced it part way, the >> biggest contributors for my test VM seem to be: >> >> 3.8ms pcibus_reset: VGA >> 1.8ms pcibus_reset: virtio-net-pci >> 1.5ms pcibus_reset: virtio-blk-pci >> 1.5ms qemu_devices_reset: piix4_reset >> 1.1ms pcibus_reset: piix3-ide >> 1.1ms pcibus_reset: virtio-rng-pci >> >> I've not looked deeper yet, but some of these are very silly; >> I'm running with -nographic so why it's taking 3.8ms to reset VGA is >> going to be interesting. >> Also, my only block device is the virtio-blk, so while I understand the >> standard PC machine has the IDE controller, why it takes it over a ms >> to reset an unused device. > > OK, so I've dug a bit deeper, and it appears that it's the changes in > PCI bars that actually take the time; every time we do a reset we > reset all the BARs, this causes it to do a pci_update_mappings and > end up doing a memory_region_del_subregion. > Then we load the config space of the PCI device as we do the vmstate_load, > and this recreates all the mappings again. > > I'm not sure what the fix is, but that sounds like it would > speed up the checkpoints usefully if we can avoid the map/remap when > they're the same. > Interesting, and thanks for your report. We already known qemu_system_reset() is a time-consuming function, we shouldn't call it here, but if we didn't do that, there will be a bug, which we have reported before in the previous COLO series, the bellow is the copy of the related patch comment: COLO VMstate: Load VM state into qsb before restore it We should not destroy the state of secondary until we receive the whole state from the primary, in case the primary fails in the middle of sending the state, so, here we cache the device state in Secondary before restore it. Besides, we should call qemu_system_reset() before load VM state, which can ensure the data is intact. Note: If we discard qemu_system_reset(), there will be some odd error, For exmple, qemu in slave side crashes and reports: KVM: entry failed, hardware error 0x7 EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0040 00000400 0000ffff 00009300 CS =f000 000f0000 0000ffff 00009b00 SS =434f 000434f0 0000ffff 00009300 DS =434f 000434f0 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 0002dcc8 00000047 IDT= 00000000 0000ffff CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 90 fa fc 66 c3 66 53 66 89 c3 ERROR: invalid runstate transition: 'internal-error' -> 'colo' The reason is, some of the device state will be ignored when saving device state to slave, if the corresponding data is in its initial value, such as 0. But the device state in slave maybe in initialized value, after a loop of checkpoint, there will be inconsistent for the value of device state. This will happen when the PVM reboot or SVM run ahead of PVM in the startup process. Signed-off-by: zhanghailiang Signed-off-by: Yang Hongyang Signed-off-by: Gonglei Reviewed-by: Dr. David Alan Gilbert env.async_pf_en_msr != 0; } Some other VMstate of registers in CPUX86State have the same problem, we can't make sure they won't cause any problems if the values of them are incorrect. So here, we just simply call qemu_system_reset() to avoid the inconsistent problem. Besides, compared with the most time-consuming operation (ram flushed from COLO cache to SVM). The time consuming for qemu_system_reset() seems to be acceptable ;) Another choice to fix the problem is to save the VMstate ignoring the needed() return value, but this method is not so graceful. Thanks, Hailiang > Dave > >> >> I guess reset is normally off anyones radar since it's outside >> the time anyone cares about, but I guess perhaps the guys trying >> to make qemu start really quickly would be interested. >> >> Dave >> >>> >>> Patch status: >>> Unreviewed: patch 21,27,28,29,33,38 >>> Updated: patch 31,34,35,37 >>> >>> TODO: >>> 1. Checkpoint based on proxy in qemu >>> 2. The capability of continuous FT >>> 3. Optimize the VM's downtime during checkpoint >>> >>> v15: >>> - Go on the shutdown process if encounter error while sending shutdown >>> message to SVM. (patch 24) >>> - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove >>> some useless comment. (patch 31, Jason) >>> - Call object_new_with_props() directly to add filter in >>> colo_add_buffer_filter. (patch 34, Jason) >>> - Re-implement colo_set_filter_status() based on COLOBufferFilters >>> list. (patch 35) >>> - Re-implement colo_flush_filter_packets() based on COLOBufferFilters >>> list. (patch 37) >>> v14: >>> - Re-implement the network processing based on netfilter (Jason Wang) >>> - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion) >>> - Split two new patches (patch 27/28) from patch 29 >>> - Fix some other comments from Dave and Markus. >>> >>> v13: >>> - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter >>> instead of return value to indicate success or failure. (patch 10) >>> - Remove the optional error message for COLO_EXIT event. (patch 25) >>> - Use semaphore to notify colo/colo incoming loop that failover work is >>> finished. (patch 26) >>> - Move COLO shutdown related codes to colo.c file. (patch 28) >>> - Fix memory leak bug for colo incoming loop. (new patch 31) >>> - Re-use some existed helper functions to realize the process of >>> saving/loading ram and device. (patch 32) >>> - Fix some other comments from Dave and Markus. >>> >>> zhanghailiang (38): >>> configure: Add parameter for configure to enable/disable COLO support >>> migration: Introduce capability 'x-colo' to migration >>> COLO: migrate colo related info to secondary node >>> migration: Integrate COLO checkpoint process into migration >>> migration: Integrate COLO checkpoint process into loadvm >>> COLO/migration: Create a new communication path from destination to >>> source >>> COLO: Implement colo checkpoint protocol >>> COLO: Add a new RunState RUN_STATE_COLO >>> QEMUSizedBuffer: Introduce two help functions for qsb >>> COLO: Save PVM state to secondary side when do checkpoint >>> COLO: Load PVM's dirty pages into SVM's RAM cache temporarily >>> ram/COLO: Record the dirty pages that SVM received >>> COLO: Load VMState into qsb before restore it >>> COLO: Flush PVM's cached RAM into SVM's memory >>> COLO: Add checkpoint-delay parameter for migrate-set-parameters >>> COLO: synchronize PVM's state to SVM periodically >>> COLO failover: Introduce a new command to trigger a failover >>> COLO failover: Introduce state to record failover process >>> COLO: Implement failover work for Primary VM >>> COLO: Implement failover work for Secondary VM >>> qmp event: Add COLO_EXIT event to notify users while exited from COLO >>> COLO failover: Shutdown related socket fd when do failover >>> COLO failover: Don't do failover during loading VM's state >>> COLO: Process shutdown command for VM in COLO state >>> COLO: Update the global runstate after going into colo state >>> savevm: Introduce two helper functions for save/find loadvm_handlers >>> entry >>> migration/savevm: Add new helpers to process the different stages of >>> loadvm >>> migration/savevm: Export two helper functions for savevm process >>> COLO: Separate the process of saving/loading ram and device state >>> COLO: Split qemu_savevm_state_begin out of checkpoint process >>> net/filter: Add a 'status' property for filter object >>> filter-buffer: Accept zero interval >>> net: Add notifier/callback for netdev init >>> COLO/filter: add each netdev a buffer filter >>> COLO: manage the status of buffer filters for PVM >>> filter-buffer: make filter_buffer_flush() public >>> COLO: flush buffered packets in checkpoint process or exit COLO >>> COLO: Add block replication into colo process >>> >>> configure | 11 + >>> docs/qmp-events.txt | 16 + >>> hmp-commands.hx | 15 + >>> hmp.c | 15 + >>> hmp.h | 1 + >>> include/exec/ram_addr.h | 1 + >>> include/migration/colo.h | 42 ++ >>> include/migration/failover.h | 33 ++ >>> include/migration/migration.h | 16 + >>> include/migration/qemu-file.h | 3 +- >>> include/net/filter.h | 5 + >>> include/net/net.h | 4 + >>> include/sysemu/sysemu.h | 9 + >>> migration/Makefile.objs | 2 + >>> migration/colo-comm.c | 76 ++++ >>> migration/colo-failover.c | 83 ++++ >>> migration/colo.c | 866 ++++++++++++++++++++++++++++++++++++++++++ >>> migration/migration.c | 109 +++++- >>> migration/qemu-file-buf.c | 61 +++ >>> migration/ram.c | 175 ++++++++- >>> migration/savevm.c | 114 ++++-- >>> net/filter-buffer.c | 14 +- >>> net/filter.c | 40 ++ >>> net/net.c | 33 ++ >>> qapi-schema.json | 104 ++++- >>> qapi/event.json | 15 + >>> qemu-options.hx | 4 +- >>> qmp-commands.hx | 23 +- >>> stubs/Makefile.objs | 1 + >>> stubs/migration-colo.c | 54 +++ >>> trace-events | 8 + >>> vl.c | 31 +- >>> 32 files changed, 1908 insertions(+), 76 deletions(-) >>> create mode 100644 include/migration/colo.h >>> create mode 100644 include/migration/failover.h >>> create mode 100644 migration/colo-comm.c >>> create mode 100644 migration/colo-failover.c >>> create mode 100644 migration/colo.c >>> create mode 100644 stubs/migration-colo.c >>> >>> -- >>> 1.8.3.1 >>> >>> >> -- >> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > . > diff --git a/migration/vmstate.c b/migration/vmstate.c index e5388f0..7d15bba 100644 --- a/migration/vmstate.c +++ b/migration/vmstate.c @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription *vmsd, bool subsection_found = false; while (sub && sub->needed) { - if (sub->needed(opaque)) { + if (sub->needed(opaque) || migrate_in_colo_state()) { const VMStateDescription *vmsd = sub->vmsd; uint8_t len;