Message ID | 20200621021004.5559-1-dereksu@qnap.com (mailing list archive) |
---|---|
Headers | show |
Series | COLO: migrate dirty ram pages before colo checkpoint | expand |
On Sun, 21 Jun 2020 10:10:03 +0800 Derek Su <dereksu@qnap.com> wrote: > This series is to reduce the guest's downtime during colo checkpoint > by migrating dirty ram pages as many as possible before colo checkpoint. > > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or > ram pending size is lower than 'x-colo-migrate-ram-threshold', > stop the ram migration and do colo checkpoint. > > Test environment: > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC > for FT traffic. > One fio buffer write job runs on the guest. > The result shows the total primary VM downtime is decreased by ~40%. > > Please help to review it and suggestions are welcomed. > Thanks. Hello Derek, Sorry for the late reply. I think this is not a good idea, because it unnecessarily introduces a delay between checkpoint request and the checkpoint itself and thus impairs network bound workloads due to increased network latency. Workloads that are independent from network don't cause many checkpoints anyway, so it doesn't help there either. Hailang did have a patch to migrate ram between checkpoints, which should help all workloads, but it wasn't merged back then. I think you can pick it up again, rebase and address David's and Eric's comments: https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhanghailiang@huawei.com/T/#u Hailang, are you ok with that? Regards, Lukas Straub
Hi Lukas Straub & Derek, Sorry for the late reply, too busy these days ;) > -----Original Message----- > From: Lukas Straub [mailto:lukasstraub2@web.de] > Sent: Friday, July 31, 2020 3:52 PM > To: Derek Su <dereksu@qnap.com> > Cc: qemu-devel@nongnu.org; Zhanghailiang > <zhang.zhanghailiang@huawei.com>; chyang@qnap.com; > quintela@redhat.com; dgilbert@redhat.com; ctcheng@qnap.com; > jwsu1986@gmail.com > Subject: Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo > checkpoint > > On Sun, 21 Jun 2020 10:10:03 +0800 > Derek Su <dereksu@qnap.com> wrote: > > > This series is to reduce the guest's downtime during colo checkpoint > > by migrating dirty ram pages as many as possible before colo checkpoint. > > > > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or > ram > > pending size is lower than 'x-colo-migrate-ram-threshold', stop the > > ram migration and do colo checkpoint. > > > > Test environment: > > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC for > FT > > traffic. > > One fio buffer write job runs on the guest. > > The result shows the total primary VM downtime is decreased by ~40%. > > > > Please help to review it and suggestions are welcomed. > > Thanks. > > Hello Derek, > Sorry for the late reply. > I think this is not a good idea, because it unnecessarily introduces a delay > between checkpoint request and the checkpoint itself and thus impairs > network bound workloads due to increased network latency. Workloads that > are independent from network don't cause many checkpoints anyway, so it > doesn't help there either. > Agreed, though it seems to reduce VM's downtime while do checkpoint, but It doesn't help to reduce network latency, because the network packages which are Different between SVM and PVM caused this checkpoint request, it will be blocked Until finishing checkpoint process. > Hailang did have a patch to migrate ram between checkpoints, which should > help all workloads, but it wasn't merged back then. I think you can pick it up > again, rebase and address David's and Eric's comments: > https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhang > hailiang@huawei.com/T/#u > The second one is not merged, which can help reduce the downtime. > Hailang, are you ok with that? > Yes. @Derek, please feel free to pick it up if you would like to ;) Thanks, Hailiang > Regards, > Lukas Straub
On Fri, Jul 31, 2020 at 3:52 PM Lukas Straub <lukasstraub2@web.de> wrote: > > On Sun, 21 Jun 2020 10:10:03 +0800 > Derek Su <dereksu@qnap.com> wrote: > > > This series is to reduce the guest's downtime during colo checkpoint > > by migrating dirty ram pages as many as possible before colo checkpoint. > > > > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or > > ram pending size is lower than 'x-colo-migrate-ram-threshold', > > stop the ram migration and do colo checkpoint. > > > > Test environment: > > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC > > for FT traffic. > > One fio buffer write job runs on the guest. > > The result shows the total primary VM downtime is decreased by ~40%. > > > > Please help to review it and suggestions are welcomed. > > Thanks. > > Hello Derek, > Sorry for the late reply. > I think this is not a good idea, because it unnecessarily introduces a delay between checkpoint request and the checkpoint itself and thus impairs network bound workloads due to increased network latency. Workloads that are independent from network don't cause many checkpoints anyway, so it doesn't help there either. > Hello, Lukas & Zhanghailiang Thanks for your opinions. I went through my patch, and I feel a little confused and would like to dig into it more. In this patch, colo_migrate_ram_before_checkpoint() is before COLO_MESSAGE_CHECKPOINT_REQUEST, so the SVM and PVM should not enter the pause state. In the meanwhile, the packets to PVM/SVM can still be compared and notify inconsistency if mismatched, right? Is it possible to introduce extra network latency? In my test (randwrite to disk by fio with direct=0), the ping from another client to the PVM using generic colo and colo used this patch are below. The network latency does not increase as my expectation. generic colo ``` 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=28.109 ms 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=16.747 ms 64 bytes from 192.168.80.18: icmp_seq=89 ttl=64 time=2388.779 ms <----checkpoint start 64 bytes from 192.168.80.18: icmp_seq=90 ttl=64 time=1385.792 ms 64 bytes from 192.168.80.18: icmp_seq=91 ttl=64 time=384.896 ms <----checkpoint end 64 bytes from 192.168.80.18: icmp_seq=92 ttl=64 time=3.895 ms 64 bytes from 192.168.80.18: icmp_seq=93 ttl=64 time=1.020 ms 64 bytes from 192.168.80.18: icmp_seq=94 ttl=64 time=0.865 ms 64 bytes from 192.168.80.18: icmp_seq=95 ttl=64 time=0.854 ms 64 bytes from 192.168.80.18: icmp_seq=96 ttl=64 time=28.359 ms 64 bytes from 192.168.80.18: icmp_seq=97 ttl=64 time=12.309 ms 64 bytes from 192.168.80.18: icmp_seq=98 ttl=64 time=0.870 ms 64 bytes from 192.168.80.18: icmp_seq=99 ttl=64 time=2371.733 ms 64 bytes from 192.168.80.18: icmp_seq=100 ttl=64 time=1371.440 ms 64 bytes from 192.168.80.18: icmp_seq=101 ttl=64 time=366.414 ms 64 bytes from 192.168.80.18: icmp_seq=102 ttl=64 time=0.818 ms 64 bytes from 192.168.80.18: icmp_seq=103 ttl=64 time=0.997 ms ``` colo used this patch ``` 64 bytes from 192.168.80.18: icmp_seq=72 ttl=64 time=1.417 ms 64 bytes from 192.168.80.18: icmp_seq=73 ttl=64 time=0.931 ms 64 bytes from 192.168.80.18: icmp_seq=74 ttl=64 time=0.876 ms 64 bytes from 192.168.80.18: icmp_seq=75 ttl=64 time=1184.034 ms <----checkpoint start 64 bytes from 192.168.80.18: icmp_seq=76 ttl=64 time=181.297 ms <----checkpoint end 64 bytes from 192.168.80.18: icmp_seq=77 ttl=64 time=0.865 ms 64 bytes from 192.168.80.18: icmp_seq=78 ttl=64 time=0.858 ms 64 bytes from 192.168.80.18: icmp_seq=79 ttl=64 time=1.247 ms 64 bytes from 192.168.80.18: icmp_seq=80 ttl=64 time=0.946 ms 64 bytes from 192.168.80.18: icmp_seq=81 ttl=64 time=0.855 ms 64 bytes from 192.168.80.18: icmp_seq=82 ttl=64 time=0.868 ms 64 bytes from 192.168.80.18: icmp_seq=83 ttl=64 time=0.749 ms 64 bytes from 192.168.80.18: icmp_seq=84 ttl=64 time=2.154 ms 64 bytes from 192.168.80.18: icmp_seq=85 ttl=64 time=1499.186 ms 64 bytes from 192.168.80.18: icmp_seq=86 ttl=64 time=496.173 ms 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=0.854 ms 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=0.774 ms ``` Thank you. Regards, Derek > Hailang did have a patch to migrate ram between checkpoints, which should help all workloads, but it wasn't merged back then. I think you can pick it up again, rebase and address David's and Eric's comments: > https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhanghailiang@huawei.com/T/#u > > Hailang, are you ok with that? > > Regards, > Lukas Straub
> -----Original Message----- > From: Derek Su [mailto:jwsu1986@gmail.com] > Sent: Thursday, August 13, 2020 6:28 PM > To: Lukas Straub <lukasstraub2@web.de> > Cc: Derek Su <dereksu@qnap.com>; qemu-devel@nongnu.org; Zhanghailiang > <zhang.zhanghailiang@huawei.com>; chyang@qnap.com; quintela@redhat.com; > dgilbert@redhat.com; ctcheng@qnap.com > Subject: Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo > checkpoint > > On Fri, Jul 31, 2020 at 3:52 PM Lukas Straub <lukasstraub2@web.de> wrote: > > > > On Sun, 21 Jun 2020 10:10:03 +0800 > > Derek Su <dereksu@qnap.com> wrote: > > > > > This series is to reduce the guest's downtime during colo checkpoint > > > by migrating dirty ram pages as many as possible before colo checkpoint. > > > > > > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or ram > > > pending size is lower than 'x-colo-migrate-ram-threshold', stop the > > > ram migration and do colo checkpoint. > > > > > > Test environment: > > > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC for > > > FT traffic. > > > One fio buffer write job runs on the guest. > > > The result shows the total primary VM downtime is decreased by ~40%. > > > > > > Please help to review it and suggestions are welcomed. > > > Thanks. > > > > Hello Derek, > > Sorry for the late reply. > > I think this is not a good idea, because it unnecessarily introduces a delay > between checkpoint request and the checkpoint itself and thus impairs network > bound workloads due to increased network latency. Workloads that are > independent from network don't cause many checkpoints anyway, so it doesn't > help there either. > > > Hi Derek, Actually, There is a quit interesting question we should think: What will happen if VM continues to run after detected a mismatched state between PVM and SVM, According to the rules of COLO, we should stop VMs immediately to sync the state between PVM and SVM, But here, you choose them to continue to run for a while, then there may be more client's network packages Coming, and may cause more memory pages dirty, another side effect is the new network packages will not Be sent out with high probability, because their replies should be different since the state between PVM and SVM is different. So, IMHO, it makes non-sense to let VMs to continue to run after detected them in different state. Besides, I don't think it is easy to construct this case in tests. Thanks, Hailiang s> Hello, Lukas & Zhanghailiang > > Thanks for your opinions. > I went through my patch, and I feel a little confused and would like to dig into it > more. > > In this patch, colo_migrate_ram_before_checkpoint() is before > COLO_MESSAGE_CHECKPOINT_REQUEST, so the SVM and PVM should not enter > the pause state. > > In the meanwhile, the packets to PVM/SVM can still be compared and notify > inconsistency if mismatched, right? > Is it possible to introduce extra network latency? > > In my test (randwrite to disk by fio with direct=0), the ping from another client to > the PVM using generic colo and colo used this patch are below. > The network latency does not increase as my expectation. > > generic colo > ``` > 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=28.109 ms > 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=16.747 ms > 64 bytes from 192.168.80.18: icmp_seq=89 ttl=64 time=2388.779 ms > <----checkpoint start > 64 bytes from 192.168.80.18: icmp_seq=90 ttl=64 time=1385.792 ms > 64 bytes from 192.168.80.18: icmp_seq=91 ttl=64 time=384.896 ms > <----checkpoint end > 64 bytes from 192.168.80.18: icmp_seq=92 ttl=64 time=3.895 ms > 64 bytes from 192.168.80.18: icmp_seq=93 ttl=64 time=1.020 ms > 64 bytes from 192.168.80.18: icmp_seq=94 ttl=64 time=0.865 ms > 64 bytes from 192.168.80.18: icmp_seq=95 ttl=64 time=0.854 ms > 64 bytes from 192.168.80.18: icmp_seq=96 ttl=64 time=28.359 ms > 64 bytes from 192.168.80.18: icmp_seq=97 ttl=64 time=12.309 ms > 64 bytes from 192.168.80.18: icmp_seq=98 ttl=64 time=0.870 ms > 64 bytes from 192.168.80.18: icmp_seq=99 ttl=64 time=2371.733 ms > 64 bytes from 192.168.80.18: icmp_seq=100 ttl=64 time=1371.440 ms > 64 bytes from 192.168.80.18: icmp_seq=101 ttl=64 time=366.414 ms > 64 bytes from 192.168.80.18: icmp_seq=102 ttl=64 time=0.818 ms > 64 bytes from 192.168.80.18: icmp_seq=103 ttl=64 time=0.997 ms ``` > > colo used this patch > ``` > 64 bytes from 192.168.80.18: icmp_seq=72 ttl=64 time=1.417 ms > 64 bytes from 192.168.80.18: icmp_seq=73 ttl=64 time=0.931 ms > 64 bytes from 192.168.80.18: icmp_seq=74 ttl=64 time=0.876 ms > 64 bytes from 192.168.80.18: icmp_seq=75 ttl=64 time=1184.034 ms > <----checkpoint start > 64 bytes from 192.168.80.18: icmp_seq=76 ttl=64 time=181.297 ms > <----checkpoint end > 64 bytes from 192.168.80.18: icmp_seq=77 ttl=64 time=0.865 ms > 64 bytes from 192.168.80.18: icmp_seq=78 ttl=64 time=0.858 ms > 64 bytes from 192.168.80.18: icmp_seq=79 ttl=64 time=1.247 ms > 64 bytes from 192.168.80.18: icmp_seq=80 ttl=64 time=0.946 ms > 64 bytes from 192.168.80.18: icmp_seq=81 ttl=64 time=0.855 ms > 64 bytes from 192.168.80.18: icmp_seq=82 ttl=64 time=0.868 ms > 64 bytes from 192.168.80.18: icmp_seq=83 ttl=64 time=0.749 ms > 64 bytes from 192.168.80.18: icmp_seq=84 ttl=64 time=2.154 ms > 64 bytes from 192.168.80.18: icmp_seq=85 ttl=64 time=1499.186 ms > 64 bytes from 192.168.80.18: icmp_seq=86 ttl=64 time=496.173 ms > 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=0.854 ms > 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=0.774 ms ``` > > Thank you. > > Regards, > Derek > > > Hailang did have a patch to migrate ram between checkpoints, which should > help all workloads, but it wasn't merged back then. I think you can pick it up again, > rebase and address David's and Eric's comments: > > https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhangh > > ailiang@huawei.com/T/#u > > > > Hailang, are you ok with that? > > > > Regards, > > Lukas Straub
On Sat, Aug 15, 2020 at 9:42 AM Zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: > > > -----Original Message----- > > From: Derek Su [mailto:jwsu1986@gmail.com] > > Sent: Thursday, August 13, 2020 6:28 PM > > To: Lukas Straub <lukasstraub2@web.de> > > Cc: Derek Su <dereksu@qnap.com>; qemu-devel@nongnu.org; Zhanghailiang > > <zhang.zhanghailiang@huawei.com>; chyang@qnap.com; quintela@redhat.com; > > dgilbert@redhat.com; ctcheng@qnap.com > > Subject: Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo > > checkpoint > > > > On Fri, Jul 31, 2020 at 3:52 PM Lukas Straub <lukasstraub2@web.de> wrote: > > > > > > On Sun, 21 Jun 2020 10:10:03 +0800 > > > Derek Su <dereksu@qnap.com> wrote: > > > > > > > This series is to reduce the guest's downtime during colo checkpoint > > > > by migrating dirty ram pages as many as possible before colo checkpoint. > > > > > > > > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or ram > > > > pending size is lower than 'x-colo-migrate-ram-threshold', stop the > > > > ram migration and do colo checkpoint. > > > > > > > > Test environment: > > > > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC for > > > > FT traffic. > > > > One fio buffer write job runs on the guest. > > > > The result shows the total primary VM downtime is decreased by ~40%. > > > > > > > > Please help to review it and suggestions are welcomed. > > > > Thanks. > > > > > > Hello Derek, > > > Sorry for the late reply. > > > I think this is not a good idea, because it unnecessarily introduces a delay > > between checkpoint request and the checkpoint itself and thus impairs network > > bound workloads due to increased network latency. Workloads that are > > independent from network don't cause many checkpoints anyway, so it doesn't > > help there either. > > > > > > > Hi Derek, > > Actually, There is a quit interesting question we should think: > What will happen if VM continues to run after detected a mismatched state between PVM and SVM, > According to the rules of COLO, we should stop VMs immediately to sync the state between PVM and SVM, > But here, you choose them to continue to run for a while, then there may be more client's network packages > Coming, and may cause more memory pages dirty, another side effect is the new network packages will not > Be sent out with high probability, because their replies should be different since the state between PVM and SVM is different. > > So, IMHO, it makes non-sense to let VMs to continue to run after detected them in different state. > Besides, I don't think it is easy to construct this case in tests. > > > Thanks, > Hailiang > Hello, Hailiang Thanks. I got your point. In my tests, the mismatch between packets does not happen, so the network latency does not increase. By the way, I've tried your commit addressing this issue. It is useful for low dirty memory and low dirty rate workload. But in high "buffered IO read/write" workload, it results in PVM resends massive and same dirty ram pages every cycle triggered by DEFAULT_RAM_PENDING_CHECK (default 1 second) timer, so hurt the IO performance and without improvement of downtime? Do you have any thoughts about this? Is it possible to separate the checkpoint invoked by the periodic timer and the packet mismatch and to use a different strategy to cope with the long downtime issue? Thanks. Regards, Derek > s> Hello, Lukas & Zhanghailiang > > > > Thanks for your opinions. > > I went through my patch, and I feel a little confused and would like to dig into it > > more. > > > > In this patch, colo_migrate_ram_before_checkpoint() is before > > COLO_MESSAGE_CHECKPOINT_REQUEST, so the SVM and PVM should not enter > > the pause state. > > > > In the meanwhile, the packets to PVM/SVM can still be compared and notify > > inconsistency if mismatched, right? > > Is it possible to introduce extra network latency? > > > > In my test (randwrite to disk by fio with direct=0), the ping from another client to > > the PVM using generic colo and colo used this patch are below. > > The network latency does not increase as my expectation. > > > > generic colo > > ``` > > 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=28.109 ms > > 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=16.747 ms > > 64 bytes from 192.168.80.18: icmp_seq=89 ttl=64 time=2388.779 ms > > <----checkpoint start > > 64 bytes from 192.168.80.18: icmp_seq=90 ttl=64 time=1385.792 ms > > 64 bytes from 192.168.80.18: icmp_seq=91 ttl=64 time=384.896 ms > > <----checkpoint end > > 64 bytes from 192.168.80.18: icmp_seq=92 ttl=64 time=3.895 ms > > 64 bytes from 192.168.80.18: icmp_seq=93 ttl=64 time=1.020 ms > > 64 bytes from 192.168.80.18: icmp_seq=94 ttl=64 time=0.865 ms > > 64 bytes from 192.168.80.18: icmp_seq=95 ttl=64 time=0.854 ms > > 64 bytes from 192.168.80.18: icmp_seq=96 ttl=64 time=28.359 ms > > 64 bytes from 192.168.80.18: icmp_seq=97 ttl=64 time=12.309 ms > > 64 bytes from 192.168.80.18: icmp_seq=98 ttl=64 time=0.870 ms > > 64 bytes from 192.168.80.18: icmp_seq=99 ttl=64 time=2371.733 ms > > 64 bytes from 192.168.80.18: icmp_seq=100 ttl=64 time=1371.440 ms > > 64 bytes from 192.168.80.18: icmp_seq=101 ttl=64 time=366.414 ms > > 64 bytes from 192.168.80.18: icmp_seq=102 ttl=64 time=0.818 ms > > 64 bytes from 192.168.80.18: icmp_seq=103 ttl=64 time=0.997 ms ``` > > > > colo used this patch > > ``` > > 64 bytes from 192.168.80.18: icmp_seq=72 ttl=64 time=1.417 ms > > 64 bytes from 192.168.80.18: icmp_seq=73 ttl=64 time=0.931 ms > > 64 bytes from 192.168.80.18: icmp_seq=74 ttl=64 time=0.876 ms > > 64 bytes from 192.168.80.18: icmp_seq=75 ttl=64 time=1184.034 ms > > <----checkpoint start > > 64 bytes from 192.168.80.18: icmp_seq=76 ttl=64 time=181.297 ms > > <----checkpoint end > > 64 bytes from 192.168.80.18: icmp_seq=77 ttl=64 time=0.865 ms > > 64 bytes from 192.168.80.18: icmp_seq=78 ttl=64 time=0.858 ms > > 64 bytes from 192.168.80.18: icmp_seq=79 ttl=64 time=1.247 ms > > 64 bytes from 192.168.80.18: icmp_seq=80 ttl=64 time=0.946 ms > > 64 bytes from 192.168.80.18: icmp_seq=81 ttl=64 time=0.855 ms > > 64 bytes from 192.168.80.18: icmp_seq=82 ttl=64 time=0.868 ms > > 64 bytes from 192.168.80.18: icmp_seq=83 ttl=64 time=0.749 ms > > 64 bytes from 192.168.80.18: icmp_seq=84 ttl=64 time=2.154 ms > > 64 bytes from 192.168.80.18: icmp_seq=85 ttl=64 time=1499.186 ms > > 64 bytes from 192.168.80.18: icmp_seq=86 ttl=64 time=496.173 ms > > 64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=0.854 ms > > 64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=0.774 ms ``` > > > > Thank you. > > > > Regards, > > Derek > > > > > Hailang did have a patch to migrate ram between checkpoints, which should > > help all workloads, but it wasn't merged back then. I think you can pick it up again, > > rebase and address David's and Eric's comments: > > > https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhangh > > > ailiang@huawei.com/T/#u > > > > > > Hailang, are you ok with that? > > > > > > Regards, > > > Lukas Straub