Message ID | B2D15215269B544CADD246097EACE747395C283F@dggeml511-mbx.china.huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Zhoujian, 2017-05-17 10:20 GMT+08:00 Zhoujian (jay) <jianjay.zhou@huawei.com>: > Hi Wanpeng, > >> > On 11/05/2017 14:07, Zhoujian (jay) wrote: >> >> - * Scan sptes if dirty logging has been stopped, dropping those >> >> - * which can be collapsed into a single large-page spte. Later >> >> - * page faults will create the large-page sptes. >> >> + * Reset each vcpu's mmu, then page faults will create the >> large-page >> >> + * sptes later. >> >> */ >> >> if ((change != KVM_MR_DELETE) && >> >> (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && >> >> - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) >> >> - kvm_mmu_zap_collapsible_sptes(kvm, new); >> >> This is an unlikely branch(unless guest live migration fails and continue >> to run on the source machine) instead of hot path, do you have any >> performance number for your real workloads? >> > > Sorry to bother you again. > > Recently, I have tested the performance before migration and after migration failure > using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard performance > evaluation tool. > > These are the results: > ****** > Before migration the score is 153, and the TLB miss statistics of the qemu process is: > linux-sjrfac:/mnt/zhoujian # perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses, \ > dTLB-stores,iTLB-load-misses,iTLB-loads -p 26463 sleep 10 > > Performance counter stats for process id '26463': > > 698,938 dTLB-load-misses # 0.13% of all dTLB cache hits (50.46%) > 543,303,875 dTLB-loads (50.43%) > 199,597 dTLB-store-misses (16.51%) > 60,128,561 dTLB-stores (16.67%) > 69,986 iTLB-load-misses # 6.17% of all iTLB cache hits (16.67%) > 1,134,097 iTLB-loads (33.33%) > > 10.000684064 seconds time elapsed > > After migration failure the score is 149, and the TLB miss statistics of the qemu process is: > linux-sjrfac:/mnt/zhoujian # perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses, \ > dTLB-stores,iTLB-load-misses,iTLB-loads -p 26463 sleep 10 > > Performance counter stats for process id '26463': > > 765,400 dTLB-load-misses # 0.14% of all dTLB cache hits (50.50%) > 540,972,144 dTLB-loads (50.47%) > 207,670 dTLB-store-misses (16.50%) > 58,363,787 dTLB-stores (16.67%) > 109,772 iTLB-load-misses # 9.52% of all iTLB cache hits (16.67%) > 1,152,784 iTLB-loads (33.32%) > > 10.000703078 seconds time elapsed > ****** Could you comment out the original "lazy collapse small sptes into large sptes" codes in the function kvm_arch_commit_memory_region() and post the results here? Regards, Wanpeng Li > > These are the steps: > ====== > (1) the version of kmod is 4.4.11(with slightly modified) and the version of qemu is 2.6.0 > (with slightly modified), the kmod is applied with the following patch according to > Paolo's advice: > > diff --git a/source/x86/x86.c b/source/x86/x86.c > index 054a7d3..75a4bb3 100644 > --- a/source/x86/x86.c > +++ b/source/x86/x86.c > @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, > */ > if ((change != KVM_MR_DELETE) && > (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && > - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) > - kvm_mmu_zap_collapsible_sptes(kvm, new); > + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { > + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); > + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); > + } > > /* > * Set up write protection and/or dirty logging for the new slot. > > (2) I started up a memory preoccupied 10G VM(suse11sp3), which means its "RES column" in top is 10G, > in order to set up the EPT table in advance. > (3) And then, I run the test case 429.mcf of spec cpu2006 before migration and after migration failure. > The 429.mcf is a memory intensive workload, and the migration failure is constructed deliberately > with the following patch of qemu: > > diff --git a/migration/migration.c b/migration/migration.c > index 5d725d0..88dfc59 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -625,6 +625,9 @@ static void process_incoming_migration_co(void *opaque) > MIGRATION_STATUS_ACTIVE); > ret = qemu_loadvm_state(f); > > + // deliberately construct the migration failure > + exit(EXIT_FAILURE); > + > ps = postcopy_state_get(); > trace_process_incoming_migration_co_end(ret, ps); > if (ps != POSTCOPY_INCOMING_NONE) { > ====== > > > Results of the score and TLB miss rate are almost the same, and I am confused. > May I ask which tool do you use to evaluate the performance? > And if my test steps are wrong, please let me know, thank you. > > Regards, > Jay Zhou > > > > >
> Recently, I have tested the performance before migration and after migration failure > using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard performance > evaluation tool. > > These are the steps: > ====== > (1) the version of kmod is 4.4.11(with slightly modified) and the version of > qemu is 2.6.0 > (with slightly modified), the kmod is applied with the following patch > > diff --git a/source/x86/x86.c b/source/x86/x86.c > index 054a7d3..75a4bb3 100644 > --- a/source/x86/x86.c > +++ b/source/x86/x86.c > @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, > */ > if ((change != KVM_MR_DELETE) && > (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && > - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) > - kvm_mmu_zap_collapsible_sptes(kvm, new); > + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { > + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); > + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); > + } > > /* > * Set up write protection and/or dirty logging for the new slot. Try these modifications to the setup: 1) set up 1G hugetlbfs hugepages and use those for the guest's memory 2) test both without and with the above patch. Thanks, Paolo
2017-05-17 15:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: >> Recently, I have tested the performance before migration and after migration failure >> using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard performance >> evaluation tool. >> >> These are the steps: >> ====== >> (1) the version of kmod is 4.4.11(with slightly modified) and the version of >> qemu is 2.6.0 >> (with slightly modified), the kmod is applied with the following patch >> >> diff --git a/source/x86/x86.c b/source/x86/x86.c >> index 054a7d3..75a4bb3 100644 >> --- a/source/x86/x86.c >> +++ b/source/x86/x86.c >> @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, >> */ >> if ((change != KVM_MR_DELETE) && >> (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && >> - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) >> - kvm_mmu_zap_collapsible_sptes(kvm, new); >> + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { >> + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); >> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); >> + } >> >> /* >> * Set up write protection and/or dirty logging for the new slot. > > Try these modifications to the setup: > > 1) set up 1G hugetlbfs hugepages and use those for the guest's memory > > 2) test both without and with the above patch. > In addition, we can compare /sys/kernel/debug/kvm/largepages w/ and w/o the patch. IIRC, /sys/kernel/debug/kvm/largepages will drop during live migration, it will keep a small value if live migration fails and w/o "lazy collapse small sptes into large sptes" codes, however, it will increase gradually if w/ the "lazy collapse small sptes into large sptes" codes. Regards, Wanpeng Li
Hi Paolo and Wanpeng, On 2017/5/17 16:38, Wanpeng Li wrote: > 2017-05-17 15:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: >>> Recently, I have tested the performance before migration and after migration failure >>> using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard performance >>> evaluation tool. >>> >>> These are the steps: >>> ====== >>> (1) the version of kmod is 4.4.11(with slightly modified) and the version of >>> qemu is 2.6.0 >>> (with slightly modified), the kmod is applied with the following patch >>> >>> diff --git a/source/x86/x86.c b/source/x86/x86.c >>> index 054a7d3..75a4bb3 100644 >>> --- a/source/x86/x86.c >>> +++ b/source/x86/x86.c >>> @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, >>> */ >>> if ((change != KVM_MR_DELETE) && >>> (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && >>> - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) >>> - kvm_mmu_zap_collapsible_sptes(kvm, new); >>> + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { >>> + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); >>> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); >>> + } >>> >>> /* >>> * Set up write protection and/or dirty logging for the new slot. >> >> Try these modifications to the setup: >> >> 1) set up 1G hugetlbfs hugepages and use those for the guest's memory >> >> 2) test both without and with the above patch. >> In order to avoid random memory allocation issues, I reran the test cases: (1) setup: start a 4U10G VM with memory preoccupied, each vcpu is pinned to a pcpu respectively, these resources(memory and pcpu) allocated to VM are all from NUMA node 0 (2) sequence: firstly, I run the 429.mcf of spec cpu2006 before migration, and get a result. And then, migration failure is constructed. At last, I run the test case again, and get an another result. (3) results: Host hugepages THP on(2M) THP on(2M) THP on(2M) THP on(2M) Patch patch1 patch2 patch3 - Before migration No No No Yes After migration failed Yes Yes Yes No Largepages 67->1862 62->1890 95->1865 1926 score of 429.mcf 189 188 188 189 Host hugepages 1G hugepages 1G hugepages 1G hugepages 1G hugepages Patch patch1 patch2 patch3 - Before migration No No No Yes After migration failed Yes Yes Yes No Largepages 21 21 26 39 score of 429.mcf 188 188 186 188 Notes: patch1 means with "lazy collapse small sptes into large sptes" codes patch2 means comment out "lazy collapse small sptes into large sptes" codes patch3 means using kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD) instead of kvm_mmu_zap_collapsible_sptes(kvm, new) "Largepages" means the value of /sys/kernel/debug/kvm/largepages > In addition, we can compare /sys/kernel/debug/kvm/largepages w/ and > w/o the patch. IIRC, /sys/kernel/debug/kvm/largepages will drop during > live migration, it will keep a small value if live migration fails and > w/o "lazy collapse small sptes into large sptes" codes, however, it > will increase gradually if w/ the "lazy collapse small sptes into > large sptes" codes. > No, without the "lazy collapse small sptes into large sptes" codes, /sys/kernel/debug/kvm/largepages does drop during live migration, but it still will increase gradually if live migration fails, see the result above. I printed out the back trace when it increases after migration failure, [139574.369098] [<ffffffff81644a7f>] dump_stack+0x19/0x1b [139574.369111] [<ffffffffa02c3af6>] mmu_set_spte+0x2f6/0x310 [kvm] [139574.369122] [<ffffffffa02c4f7e>] __direct_map.isra.109+0x1de/0x250 [kvm] [139574.369133] [<ffffffffa02c8a76>] tdp_page_fault+0x246/0x280 [kvm] [139574.369144] [<ffffffffa02bf4e4>] kvm_mmu_page_fault+0x24/0x130 [kvm] [139574.369148] [<ffffffffa07c8116>] handle_ept_violation+0x96/0x170 [kvm_intel] [139574.369153] [<ffffffffa07cf949>] vmx_handle_exit+0x299/0xbf0 [kvm_intel] [139574.369157] [<ffffffff816559f0>] ? uv_bau_message_intr1+0x80/0x80 [139574.369161] [<ffffffffa07cd5e0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel] [139574.369172] [<ffffffffa02b35cd>] vcpu_enter_guest+0x76d/0x1160 [kvm] [139574.369184] [<ffffffffa02d9285>] ? kvm_apic_local_deliver+0x65/0x70 [kvm] [139574.369196] [<ffffffffa02bb125>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm] [139574.369205] [<ffffffffa02a2b11>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm] [139574.369209] [<ffffffff810e7852>] ? do_futex+0x122/0x5b0 [139574.369212] [<ffffffff811fd9d5>] do_vfs_ioctl+0x2e5/0x4c0 [139574.369223] [<ffffffffa02b0cf5>] ? kvm_on_user_return+0x75/0xb0 [kvm] [139574.369225] [<ffffffff811fdc51>] SyS_ioctl+0xa1/0xc0 [139574.369229] [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b Any suggestion will be appreciated, Thanks! Regards, Jay Zhou
I do not know why i was removed from the list. On 05/19/2017 04:09 PM, Jay Zhou wrote: > Hi Paolo and Wanpeng, > > On 2017/5/17 16:38, Wanpeng Li wrote: >> 2017-05-17 15:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: >>>> Recently, I have tested the performance before migration and after migration failure >>>> using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard performance >>>> evaluation tool. >>>> >>>> These are the steps: >>>> ====== >>>> (1) the version of kmod is 4.4.11(with slightly modified) and the version of >>>> qemu is 2.6.0 >>>> (with slightly modified), the kmod is applied with the following patch >>>> >>>> diff --git a/source/x86/x86.c b/source/x86/x86.c >>>> index 054a7d3..75a4bb3 100644 >>>> --- a/source/x86/x86.c >>>> +++ b/source/x86/x86.c >>>> @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, >>>> */ >>>> if ((change != KVM_MR_DELETE) && >>>> (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && >>>> - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) >>>> - kvm_mmu_zap_collapsible_sptes(kvm, new); >>>> + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { >>>> + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); >>>> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); >>>> + } >>>> >>>> /* >>>> * Set up write protection and/or dirty logging for the new slot. >>> >>> Try these modifications to the setup: >>> >>> 1) set up 1G hugetlbfs hugepages and use those for the guest's memory >>> >>> 2) test both without and with the above patch. >>> > > In order to avoid random memory allocation issues, I reran the test cases: > (1) setup: start a 4U10G VM with memory preoccupied, each vcpu is pinned to a pcpu respectively, these resources(memory and pcpu) allocated to VM are all from NUMA node 0 > (2) sequence: firstly, I run the 429.mcf of spec cpu2006 before migration, and get a result. And then, migration failure is constructed. At last, I run the test case again, and get an another result. I guess this case purely writes the memory, that means the readonly mappings will always be dropped by #PF, then huge mappings are established. If benchmark memory read, you show observe its difference. Thanks!
Hi Xiao, On 2017/5/19 16:32, Xiao Guangrong wrote: > > I do not know why i was removed from the list. I was CCed to you... Your comments are very valuable to us, and thank for your quick response. > > On 05/19/2017 04:09 PM, Jay Zhou wrote: >> Hi Paolo and Wanpeng, >> >> On 2017/5/17 16:38, Wanpeng Li wrote: >>> 2017-05-17 15:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: >>>>> Recently, I have tested the performance before migration and after >>>>> migration failure >>>>> using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard >>>>> performance >>>>> evaluation tool. >>>>> >>>>> These are the steps: >>>>> ====== >>>>> (1) the version of kmod is 4.4.11(with slightly modified) and the >>>>> version of >>>>> qemu is 2.6.0 >>>>> (with slightly modified), the kmod is applied with the following patch >>>>> >>>>> diff --git a/source/x86/x86.c b/source/x86/x86.c >>>>> index 054a7d3..75a4bb3 100644 >>>>> --- a/source/x86/x86.c >>>>> +++ b/source/x86/x86.c >>>>> @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, >>>>> */ >>>>> if ((change != KVM_MR_DELETE) && >>>>> (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && >>>>> - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) >>>>> - kvm_mmu_zap_collapsible_sptes(kvm, new); >>>>> + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { >>>>> + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); >>>>> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); >>>>> + } >>>>> >>>>> /* >>>>> * Set up write protection and/or dirty logging for the new slot. >>>> >>>> Try these modifications to the setup: >>>> >>>> 1) set up 1G hugetlbfs hugepages and use those for the guest's memory >>>> >>>> 2) test both without and with the above patch. >>>> >> >> In order to avoid random memory allocation issues, I reran the test cases: >> (1) setup: start a 4U10G VM with memory preoccupied, each vcpu is pinned to a >> pcpu respectively, these resources(memory and pcpu) allocated to VM are all >> from NUMA node 0 >> (2) sequence: firstly, I run the 429.mcf of spec cpu2006 before migration, >> and get a result. And then, migration failure is constructed. At last, I run >> the test case again, and get an another result. > > I guess this case purely writes the memory, that means the readonly mappings will Yes, I printed out the speed of dirty page rate, it is about 1GB per second. > always be dropped by #PF, then huge mappings are established. > > If benchmark memory read, you show observe its difference. > OK, thank for your suggestion! Regards, Jay Zhou
On Fri, 19 May 2017 at 16:10, Jay Zhou <jianjay.zhou@huawei.com> wrote: > > Hi Paolo and Wanpeng, > > On 2017/5/17 16:38, Wanpeng Li wrote: > > 2017-05-17 15:43 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: > >>> Recently, I have tested the performance before migration and after migration failure > >>> using spec cpu2006 https://www.spec.org/cpu2006/, which is a standard performance > >>> evaluation tool. > >>> > >>> These are the steps: > >>> ====== > >>> (1) the version of kmod is 4.4.11(with slightly modified) and the version of > >>> qemu is 2.6.0 > >>> (with slightly modified), the kmod is applied with the following patch > >>> > >>> diff --git a/source/x86/x86.c b/source/x86/x86.c > >>> index 054a7d3..75a4bb3 100644 > >>> --- a/source/x86/x86.c > >>> +++ b/source/x86/x86.c > >>> @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, > >>> */ > >>> if ((change != KVM_MR_DELETE) && > >>> (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && > >>> - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) > >>> - kvm_mmu_zap_collapsible_sptes(kvm, new); > >>> + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { > >>> + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); > >>> + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); > >>> + } > >>> > >>> /* > >>> * Set up write protection and/or dirty logging for the new slot. > >> > >> Try these modifications to the setup: > >> > >> 1) set up 1G hugetlbfs hugepages and use those for the guest's memory > >> > >> 2) test both without and with the above patch. > >> > > In order to avoid random memory allocation issues, I reran the test cases: > (1) setup: start a 4U10G VM with memory preoccupied, each vcpu is pinned to a > pcpu respectively, these resources(memory and pcpu) allocated to VM are all > from NUMA node 0 > (2) sequence: firstly, I run the 429.mcf of spec cpu2006 before migration, and > get a result. And then, migration failure is constructed. At last, I run the > test case again, and get an another result. > (3) results: > Host hugepages THP on(2M) THP on(2M) THP on(2M) THP on(2M) > Patch patch1 patch2 patch3 - > Before migration No No No Yes > After migration failed Yes Yes Yes No > Largepages 67->1862 62->1890 95->1865 1926 > score of 429.mcf 189 188 188 189 > > Host hugepages 1G hugepages 1G hugepages 1G hugepages 1G hugepages > Patch patch1 patch2 patch3 - > Before migration No No No Yes > After migration failed Yes Yes Yes No > Largepages 21 21 26 39 > score of 429.mcf 188 188 186 188 > > Notes: > patch1 means with "lazy collapse small sptes into large sptes" codes > patch2 means comment out "lazy collapse small sptes into large sptes" codes > patch3 means using kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD) > instead of kvm_mmu_zap_collapsible_sptes(kvm, new) > > "Largepages" means the value of /sys/kernel/debug/kvm/largepages > > > In addition, we can compare /sys/kernel/debug/kvm/largepages w/ and > > w/o the patch. IIRC, /sys/kernel/debug/kvm/largepages will drop during > > live migration, it will keep a small value if live migration fails and > > w/o "lazy collapse small sptes into large sptes" codes, however, it > > will increase gradually if w/ the "lazy collapse small sptes into > > large sptes" codes. > > > > No, without the "lazy collapse small sptes into large sptes" codes, > /sys/kernel/debug/kvm/largepages does drop during live migration, > but it still will increase gradually if live migration fails, see the result > above. I printed out the back trace when it increases after migration failure, > > [139574.369098] [<ffffffff81644a7f>] dump_stack+0x19/0x1b > [139574.369111] [<ffffffffa02c3af6>] mmu_set_spte+0x2f6/0x310 [kvm] > [139574.369122] [<ffffffffa02c4f7e>] __direct_map.isra.109+0x1de/0x250 [kvm] > [139574.369133] [<ffffffffa02c8a76>] tdp_page_fault+0x246/0x280 [kvm] > [139574.369144] [<ffffffffa02bf4e4>] kvm_mmu_page_fault+0x24/0x130 [kvm] > [139574.369148] [<ffffffffa07c8116>] handle_ept_violation+0x96/0x170 [kvm_intel] > [139574.369153] [<ffffffffa07cf949>] vmx_handle_exit+0x299/0xbf0 [kvm_intel] > [139574.369157] [<ffffffff816559f0>] ? uv_bau_message_intr1+0x80/0x80 > [139574.369161] [<ffffffffa07cd5e0>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel] > [139574.369172] [<ffffffffa02b35cd>] vcpu_enter_guest+0x76d/0x1160 [kvm] > [139574.369184] [<ffffffffa02d9285>] ? kvm_apic_local_deliver+0x65/0x70 [kvm] > [139574.369196] [<ffffffffa02bb125>] kvm_arch_vcpu_ioctl_run+0xd5/0x440 [kvm] > [139574.369205] [<ffffffffa02a2b11>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm] > [139574.369209] [<ffffffff810e7852>] ? do_futex+0x122/0x5b0 > [139574.369212] [<ffffffff811fd9d5>] do_vfs_ioctl+0x2e5/0x4c0 > [139574.369223] [<ffffffffa02b0cf5>] ? kvm_on_user_return+0x75/0xb0 [kvm] > [139574.369225] [<ffffffff811fdc51>] SyS_ioctl+0xa1/0xc0 > [139574.369229] [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b > > Any suggestion will be appreciated, Thanks! I found some time to figure it out, there is a simple program to reproduce in the guest: #include <stdio.h> #include <pthread.h> #include <unistd.h> #define BUFSIZE (1024 * 1024) int useconds = 0; int mbytes = 0; void *memory_write(void *arg) { int i = arg; int j = 0; char *p_buf = NULL; p_buf = (char *)malloc(mbytes * BUFSIZE); //use the memory memset(p_buf, 0, mbytes * BUFSIZE); printf("thread: %d\n", i); while (1) { for (j = 0; j < mbytes; j++) { memset(&p_buf[j * BUFSIZE], 0, 100); } usleep(useconds); } } int main(int argc, const char *argv[]) { int i = 0; int ret = 0; int threads = 0; pthread_t tid = 0; mbytes = atoi(argv[1]); threads = atoi(argv[2]); useconds = atoi(argv[3]); if (mbytes == 0 || threads == 0 || useconds == 0) { printf("get mbytes or threads or useconds error\n"); return 1; } printf("mbytes:%dm, thread:%d, useconds:%d\n", mbytes, threads, useconds); for (i=0; i< threads; i++) { ret = pthread_create(&tid, NULL, (void *)memory_write, (void *)i); if(ret) { printf("Create pthread error!\n"); return 1; } } sleep(1000000); return 0; } I try ./a.out 100 50 2 which means it will spawn 50 threads, each allocate 100MB, and sleep 2us after each round of writing. In addition, it just dirties 100 byte(which just occupies 4KB page) of each 1MB memory. The large sptes are dropped in the ept violation path since the large sptes are write-protect during live migration, small sptes are populated in this process, however, in the above setup, just 2 small sptes for each 2MB memory range are populated, there is no further ept violation and no further small sptes are replaced by large sptes after migration fails since the 2 small sptes are still populated. If I stop the a.out and run it the second time, the memory of a.out is reallocated, it probably allocate other gfns, the small sptes are replaced by large sptes during this process since the sptes(the remaining sptes in the 2MB memory except the 2 before) of the new gfns are empty and ept violation path figure out it is huge page backed. I do another testing, replace the 100 bytes by BUFSIZE which means that it will dirty the whole 1MB memory, this result in all the small sptes are populated, it will not be replaced by large sptes any more after migration fails. For the the 429.mcf of spec cpu2006 testcase, the RES is 10GB, I guess the whole memory of each 2MB is not accessing simultaneously, during EPT violation, most large sptes are dropped, part of each 2MB memory is accessed and small sptes are populated. The small sptes will be dropped and replaced by large sptes in the ept violation path if other part of each 2MB memory is accessed after migration fails. Regards, Wanpeng Li
====== (1) the version of kmod is 4.4.11(with slightly modified) and the version of qemu is 2.6.0 (with slightly modified), the kmod is applied with the following patch according to Paolo's advice: diff --git a/source/x86/x86.c b/source/x86/x86.c index 054a7d3..75a4bb3 100644 --- a/source/x86/x86.c +++ b/source/x86/x86.c @@ -8550,8 +8550,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, */ if ((change != KVM_MR_DELETE) && (old->flags & KVM_MEM_LOG_DIRTY_PAGES) && - !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) - kvm_mmu_zap_collapsible_sptes(kvm, new); + !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) { + printk(KERN_ERR "zj make KVM_REQ_MMU_RELOAD request\n"); + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); + } /* * Set up write protection and/or dirty logging for the new slot. (2) I started up a memory preoccupied 10G VM(suse11sp3), which means its "RES column" in top is 10G, in order to set up the EPT table in advance. (3) And then, I run the test case 429.mcf of spec cpu2006 before migration and after migration failure. The 429.mcf is a memory intensive workload, and the migration failure is constructed deliberately with the following patch of qemu: diff --git a/migration/migration.c b/migration/migration.c index 5d725d0..88dfc59 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -625,6 +625,9 @@ static void process_incoming_migration_co(void *opaque) MIGRATION_STATUS_ACTIVE); ret = qemu_loadvm_state(f); + // deliberately construct the migration failure + exit(EXIT_FAILURE); + ps = postcopy_state_get(); trace_process_incoming_migration_co_end(ret, ps); if (ps != POSTCOPY_INCOMING_NONE) {