mbox series

[v6,0/4] migration: UFFD write-tracking migration/snapshots

Message ID 20201209100811.190316-1-andrey.gruzdev@virtuozzo.com (mailing list archive)
Headers show
Series migration: UFFD write-tracking migration/snapshots | expand

Message

Andrey Gruzdev Dec. 9, 2020, 10:08 a.m. UTC
This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Currently the only way to make (external) live VM snapshot is using existing
dirty page logging migration mechanism. The main problem is that it tends to
produce a lot of page duplicates while running VM goes on updating already
saved pages. That leads to the fact that vmstate image size is commonly several
times bigger then non-zero part of virtual machine's RSS. Time required to
converge RAM migration and the size of snapshot image severely depend on the
guest memory write rate, sometimes resulting in unacceptably long snapshot
creation time and huge image size.

This series propose a way to solve the aforementioned problems. This is done
by using different RAM migration mechanism based on UFFD write protection
management introduced in v5.7 kernel. The migration strategy is to 'freeze'
guest RAM content using write-protection and iteratively release protection
for memory ranges that have already been saved to the migration stream.
At the same time we read in pending UFFD write fault events and save those
pages out-of-order with higher priority.

How to use:
1. Enable write-tracking migration capability
   virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
track-writes-ram on

2. Start the external migration to a file
   virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'

3. Wait for the migration finish and check that the migration has completed.
state.

Changes v4->v5:

* 1. Refactored util/userfaultfd.c code to support features required by postcopy.
* 2. Introduced checks for host kernel and guest memory backend compatibility
*    to 'background-snapshot' branch in migrate_caps_check().
* 3. Switched to using trace_xxx instead of info_report()/error_report() for
*    cases when error message must be hidden (probing UFFD-IO) or info may be
*    really littering output if goes to stderr.
* 4  Added RCU_READ_LOCK_GUARDs to the code dealing with RAM block list.
* 5. Added memory_region_ref() for each RAM block being wr-protected.
* 6. Reused qemu_ram_block_from_host() instead of custom RAM block lookup routine.
* 7. Refused from using specific hwaddr/ram_addr_t in favour of void */uint64_t.
* 8. Currently dropped 'linear-scan-rate-limiting' patch. The reason is that
*    that choosen criteria for high-latency fault detection (i.e. timestamp of
*    UFFD event fetch) is not representative enough for this task.
*    At the moment it looks somehow like premature optimization effort.
* 8. Dropped some unnecessary/unused code.

Changes v5->v6:

* 1. Consider possible hot pluggin/unpluggin of memory device - don't use static
*    for write-tracking support level in migrate_query_write_tracking(), check
*    each time when one tries to enable 'background-snapshot' capability.

Andrey Gruzdev (4):
  migration: introduce 'background-snapshot' migration capability
  migration: introduce UFFD-WP low-level interface helpers
  migration: support UFFD write fault processing in ram_save_iterate()
  migration: implementation of background snapshot thread

 include/exec/memory.h      |   8 +
 include/qemu/userfaultfd.h |  35 ++++
 migration/migration.c      | 357 ++++++++++++++++++++++++++++++++++++-
 migration/migration.h      |   4 +
 migration/ram.c            | 270 ++++++++++++++++++++++++++++
 migration/ram.h            |   6 +
 migration/savevm.c         |   1 -
 migration/savevm.h         |   2 +
 migration/trace-events     |   2 +
 qapi/migration.json        |   7 +-
 util/meson.build           |   1 +
 util/trace-events          |   9 +
 util/userfaultfd.c         | 345 +++++++++++++++++++++++++++++++++++
 13 files changed, 1043 insertions(+), 4 deletions(-)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100644 util/userfaultfd.c

Comments

Andrey Gruzdev Dec. 11, 2020, 1:13 p.m. UTC | #1
On 09.12.2020 13:08, Andrey Gruzdev wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>
> Currently the only way to make (external) live VM snapshot is using existing
> dirty page logging migration mechanism. The main problem is that it tends to
> produce a lot of page duplicates while running VM goes on updating already
> saved pages. That leads to the fact that vmstate image size is commonly several
> times bigger then non-zero part of virtual machine's RSS. Time required to
> converge RAM migration and the size of snapshot image severely depend on the
> guest memory write rate, sometimes resulting in unacceptably long snapshot
> creation time and huge image size.
>
> This series propose a way to solve the aforementioned problems. This is done
> by using different RAM migration mechanism based on UFFD write protection
> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> guest RAM content using write-protection and iteratively release protection
> for memory ranges that have already been saved to the migration stream.
> At the same time we read in pending UFFD write fault events and save those
> pages out-of-order with higher priority.
>
> How to use:
> 1. Enable write-tracking migration capability
>     virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
> track-writes-ram on
>
> 2. Start the external migration to a file
>     virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'
>
> 3. Wait for the migration finish and check that the migration has completed.
> state.
>
> Changes v4->v5:
>
> * 1. Refactored util/userfaultfd.c code to support features required by postcopy.
> * 2. Introduced checks for host kernel and guest memory backend compatibility
> *    to 'background-snapshot' branch in migrate_caps_check().
> * 3. Switched to using trace_xxx instead of info_report()/error_report() for
> *    cases when error message must be hidden (probing UFFD-IO) or info may be
> *    really littering output if goes to stderr.
> * 4  Added RCU_READ_LOCK_GUARDs to the code dealing with RAM block list.
> * 5. Added memory_region_ref() for each RAM block being wr-protected.
> * 6. Reused qemu_ram_block_from_host() instead of custom RAM block lookup routine.
> * 7. Refused from using specific hwaddr/ram_addr_t in favour of void */uint64_t.
> * 8. Currently dropped 'linear-scan-rate-limiting' patch. The reason is that
> *    that choosen criteria for high-latency fault detection (i.e. timestamp of
> *    UFFD event fetch) is not representative enough for this task.
> *    At the moment it looks somehow like premature optimization effort.
> * 8. Dropped some unnecessary/unused code.
>
> Changes v5->v6:
>
> * 1. Consider possible hot pluggin/unpluggin of memory device - don't use static
> *    for write-tracking support level in migrate_query_write_tracking(), check
> *    each time when one tries to enable 'background-snapshot' capability.
>
> Andrey Gruzdev (4):
>    migration: introduce 'background-snapshot' migration capability
>    migration: introduce UFFD-WP low-level interface helpers
>    migration: support UFFD write fault processing in ram_save_iterate()
>    migration: implementation of background snapshot thread
>
>   include/exec/memory.h      |   8 +
>   include/qemu/userfaultfd.h |  35 ++++
>   migration/migration.c      | 357 ++++++++++++++++++++++++++++++++++++-
>   migration/migration.h      |   4 +
>   migration/ram.c            | 270 ++++++++++++++++++++++++++++
>   migration/ram.h            |   6 +
>   migration/savevm.c         |   1 -
>   migration/savevm.h         |   2 +
>   migration/trace-events     |   2 +
>   qapi/migration.json        |   7 +-
>   util/meson.build           |   1 +
>   util/trace-events          |   9 +
>   util/userfaultfd.c         | 345 +++++++++++++++++++++++++++++++++++
>   13 files changed, 1043 insertions(+), 4 deletions(-)
>   create mode 100644 include/qemu/userfaultfd.h
>   create mode 100644 util/userfaultfd.c
>
I've also made wr-fault resolution latency measurements, for the case when migration
stream is dumped to a file in cached mode.. Should approximately match saving to the
file fd directly though I used 'migrate exec:<>' using a hand-written tool.

VM config is 6 vCPUs + 16GB RAM, qcow2 image on Seagate 7200.11 series 1.5TB HDD,
snapshot goes to the same disk. Guest is Windows 10.

The test scenario is playing full-HD youtube video in Firefox while saving snapshot.

Latency measurement begin/end points are fs/userfaultfd.c:handle_userfault() and
mm/userfaultfd.c:mwriteprotect_range(), respectively. For any faulting page, the
oldest wr-fault timestamp is accounted.

The whole time to take snapshot was ~30secs, file size is around 3GB.
So far seems to be not a very bad picture.. However 16-255msecs range is worrying
me a bit, seems it causes audio backend buffer underflows sometimes.


      msecs               : count     distribution
          0 -> 1          : 111755   |****************************************|
          2 -> 3          : 52       |                                        |
          4 -> 7          : 105      |                                        |
          8 -> 15         : 428      |                                        |
         16 -> 31         : 335      |                                        |
         32 -> 63         : 4        |                                        |
         64 -> 127        : 8        |                                        |
        128 -> 255        : 5        |                                        |
Peter Xu Dec. 11, 2020, 3:09 p.m. UTC | #2
On Fri, Dec 11, 2020 at 04:13:02PM +0300, Andrey Gruzdev wrote:
> I've also made wr-fault resolution latency measurements, for the case when migration
> stream is dumped to a file in cached mode.. Should approximately match saving to the
> file fd directly though I used 'migrate exec:<>' using a hand-written tool.
> 
> VM config is 6 vCPUs + 16GB RAM, qcow2 image on Seagate 7200.11 series 1.5TB HDD,
> snapshot goes to the same disk. Guest is Windows 10.
> 
> The test scenario is playing full-HD youtube video in Firefox while saving snapshot.
> 
> Latency measurement begin/end points are fs/userfaultfd.c:handle_userfault() and
> mm/userfaultfd.c:mwriteprotect_range(), respectively. For any faulting page, the
> oldest wr-fault timestamp is accounted.
> 
> The whole time to take snapshot was ~30secs, file size is around 3GB.
> So far seems to be not a very bad picture.. However 16-255msecs range is worrying
> me a bit, seems it causes audio backend buffer underflows sometimes.
> 
> 
>      msecs               : count     distribution
>          0 -> 1          : 111755   |****************************************|
>          2 -> 3          : 52       |                                        |
>          4 -> 7          : 105      |                                        |
>          8 -> 15         : 428      |                                        |
>         16 -> 31         : 335      |                                        |
>         32 -> 63         : 4        |                                        |
>         64 -> 127        : 8        |                                        |
>        128 -> 255        : 5        |                                        |

Great test!  Thanks for sharing these information.

Yes it's good enough for a 1st version, so it's already better than
functionally work. :)

So did you try your last previous patch to see whether it could improve in some
way?  Again we can gradually optimize upon your current work.

Btw, you reminded me that why not we track all these from kernel? :) That's a
good idea.  So, how did you trace it yourself?  Something like below should
work with bpftrace, but I feel like you were done in some other way, so just
fyi:

        # cat latency.bpf
        kprobe:handle_userfault
        {
                @start[tid] = nsecs;
        }

        kretprobe:handle_userfault
        {
                if (@start[tid]) {
                        $delay = nsecs - @start[tid];
                        delete(@start[tid]);
                        @delay_us = hist($delay / 1000);
                }
        }
        # bpftrace latency.bpf

Tracing return of handle_userfault() could be more accurate in that it also
takes the latency between UFFDIO_WRITEPROTECT until vcpu got waked up again.
However it's inaccurate because after a recent change to this code path in
commit f9bf352224d7 ("userfaultfd: simplify fault handling", 2020-08-03)
handle_userfault() could return even before page fault resolved.  However it
should be good enough in most cases because even if it happens, it'll fault
into handle_userfault() again, then we just got one more count.

Thanks!
Andrey Gruzdev Dec. 15, 2020, 7:52 p.m. UTC | #3
On 11.12.2020 18:09, Peter Xu wrote:
> On Fri, Dec 11, 2020 at 04:13:02PM +0300, Andrey Gruzdev wrote:
>> I've also made wr-fault resolution latency measurements, for the case when migration
>> stream is dumped to a file in cached mode.. Should approximately match saving to the
>> file fd directly though I used 'migrate exec:<>' using a hand-written tool.
>>
>> VM config is 6 vCPUs + 16GB RAM, qcow2 image on Seagate 7200.11 series 1.5TB HDD,
>> snapshot goes to the same disk. Guest is Windows 10.
>>
>> The test scenario is playing full-HD youtube video in Firefox while saving snapshot.
>>
>> Latency measurement begin/end points are fs/userfaultfd.c:handle_userfault() and
>> mm/userfaultfd.c:mwriteprotect_range(), respectively. For any faulting page, the
>> oldest wr-fault timestamp is accounted.
>>
>> The whole time to take snapshot was ~30secs, file size is around 3GB.
>> So far seems to be not a very bad picture.. However 16-255msecs range is worrying
>> me a bit, seems it causes audio backend buffer underflows sometimes.
>>
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 111755   |****************************************|
>>           2 -> 3          : 52       |                                        |
>>           4 -> 7          : 105      |                                        |
>>           8 -> 15         : 428      |                                        |
>>          16 -> 31         : 335      |                                        |
>>          32 -> 63         : 4        |                                        |
>>          64 -> 127        : 8        |                                        |
>>         128 -> 255        : 5        |                                        |
> Great test!  Thanks for sharing these information.
>
> Yes it's good enough for a 1st version, so it's already better than
> functionally work. :)
>
> So did you try your last previous patch to see whether it could improve in some
> way?  Again we can gradually optimize upon your current work.
>
> Btw, you reminded me that why not we track all these from kernel? :) That's a
> good idea.  So, how did you trace it yourself?  Something like below should
> work with bpftrace, but I feel like you were done in some other way, so just
> fyi:
>
>          # cat latency.bpf
>          kprobe:handle_userfault
>          {
>                  @start[tid] = nsecs;
>          }
>
>          kretprobe:handle_userfault
>          {
>                  if (@start[tid]) {
>                          $delay = nsecs - @start[tid];
>                          delete(@start[tid]);
>                          @delay_us = hist($delay / 1000);
>                  }
>          }
>          # bpftrace latency.bpf
>
> Tracing return of handle_userfault() could be more accurate in that it also
> takes the latency between UFFDIO_WRITEPROTECT until vcpu got waked up again.
> However it's inaccurate because after a recent change to this code path in
> commit f9bf352224d7 ("userfaultfd: simplify fault handling", 2020-08-03)
> handle_userfault() could return even before page fault resolved.  However it
> should be good enough in most cases because even if it happens, it'll fault
> into handle_userfault() again, then we just got one more count.
>
> Thanks!
>
Peter, thanks for idea, now I've also tried with kretprobe, for Windows 10
and Ubuntu 20.04 guests, two runs for each. Windows is ugly here(

First are series of runs without scan-rate-limiting.patch.
Windows 10:

      msecs               : count     distribution
          0 -> 1          : 131913   |****************************************|
          2 -> 3          : 106      |                                        |
          4 -> 7          : 362      |                                        |
          8 -> 15         : 619      |                                        |
         16 -> 31         : 28       |                                        |
         32 -> 63         : 1        |                                        |
         64 -> 127        : 2        |                                        |


      msecs               : count     distribution
          0 -> 1          : 199273   |****************************************|
          2 -> 3          : 190      |                                        |
          4 -> 7          : 425      |                                        |
          8 -> 15         : 927      |                                        |
         16 -> 31         : 69       |                                        |
         32 -> 63         : 3        |                                        |
         64 -> 127        : 16       |                                        |
        128 -> 255        : 2        |                                        |

Ubuntu 20.04:

      msecs               : count     distribution
          0 -> 1          : 104954   |****************************************|
          2 -> 3          : 9        |                                        |

      msecs               : count     distribution
          0 -> 1          : 147159   |****************************************|
          2 -> 3          : 13       |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 0        |                                        |
         16 -> 31         : 0        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 1        |                                        |


Here are runs with scan-rate-limiting.patch.
Windows 10:

      msecs               : count     distribution
          0 -> 1          : 234492   |****************************************|
          2 -> 3          : 66       |                                        |
          4 -> 7          : 219      |                                        |
          8 -> 15         : 109      |                                        |
         16 -> 31         : 0        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 1        |                                        |

      msecs               : count     distribution
          0 -> 1          : 183171   |****************************************|
          2 -> 3          : 109      |                                        |
          4 -> 7          : 281      |                                        |
          8 -> 15         : 444      |                                        |
         16 -> 31         : 3        |                                        |
         32 -> 63         : 1        |                                        |

Ubuntu 20.04:

      msecs               : count     distribution
          0 -> 1          : 92224    |****************************************|
          2 -> 3          : 9        |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 0        |                                        |
         16 -> 31         : 1        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 1        |                                        |

      msecs               : count     distribution
          0 -> 1          : 97021    |****************************************|
          2 -> 3          : 7        |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 0        |                                        |
         16 -> 31         : 0        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 0        |                                        |
        128 -> 255        : 1        |                                        |

So, initial variant of rate-limiting makes some positive effect, but not very
noticible. Interesting is the case of Windows guest, why the difference is so large,
compared to Linux. The reason (theoretically) might be some of virtio or QXL drivers,
hard to say. At least Windows VM has been configured with a set of Hyper-V
enlightments, there's nothing to improve in domain config.

For Linux guests latencies are good enough without any additional efforts.

Also, I've missed some code to deal with snapshotting of suspended guest, so I'll
make v7 series with that fix and also try to add more effective solution to reduce
millisecond-grade latencies.

And yes, I've used bpftrace-like tool - BCC from iovisor with python frontend. Seems a bit more
friendly then bpftrace.
Andrey Gruzdev Dec. 15, 2020, 7:53 p.m. UTC | #4
On 11.12.2020 18:09, Peter Xu wrote:
> On Fri, Dec 11, 2020 at 04:13:02PM +0300, Andrey Gruzdev wrote:
>> I've also made wr-fault resolution latency measurements, for the case when migration
>> stream is dumped to a file in cached mode.. Should approximately match saving to the
>> file fd directly though I used 'migrate exec:<>' using a hand-written tool.
>>
>> VM config is 6 vCPUs + 16GB RAM, qcow2 image on Seagate 7200.11 series 1.5TB HDD,
>> snapshot goes to the same disk. Guest is Windows 10.
>>
>> The test scenario is playing full-HD youtube video in Firefox while saving snapshot.
>>
>> Latency measurement begin/end points are fs/userfaultfd.c:handle_userfault() and
>> mm/userfaultfd.c:mwriteprotect_range(), respectively. For any faulting page, the
>> oldest wr-fault timestamp is accounted.
>>
>> The whole time to take snapshot was ~30secs, file size is around 3GB.
>> So far seems to be not a very bad picture.. However 16-255msecs range is worrying
>> me a bit, seems it causes audio backend buffer underflows sometimes.
>>
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 111755   |****************************************|
>>           2 -> 3          : 52       |                                        |
>>           4 -> 7          : 105      |                                        |
>>           8 -> 15         : 428      |                                        |
>>          16 -> 31         : 335      |                                        |
>>          32 -> 63         : 4        |                                        |
>>          64 -> 127        : 8        |                                        |
>>         128 -> 255        : 5        |                                        |
> Great test!  Thanks for sharing these information.
>
> Yes it's good enough for a 1st version, so it's already better than
> functionally work. :)
>
> So did you try your last previous patch to see whether it could improve in some
> way?  Again we can gradually optimize upon your current work.
>
> Btw, you reminded me that why not we track all these from kernel? :) That's a
> good idea.  So, how did you trace it yourself?  Something like below should
> work with bpftrace, but I feel like you were done in some other way, so just
> fyi:
>
>          # cat latency.bpf
>          kprobe:handle_userfault
>          {
>                  @start[tid] = nsecs;
>          }
>
>          kretprobe:handle_userfault
>          {
>                  if (@start[tid]) {
>                          $delay = nsecs - @start[tid];
>                          delete(@start[tid]);
>                          @delay_us = hist($delay / 1000);
>                  }
>          }
>          # bpftrace latency.bpf
>
> Tracing return of handle_userfault() could be more accurate in that it also
> takes the latency between UFFDIO_WRITEPROTECT until vcpu got waked up again.
> However it's inaccurate because after a recent change to this code path in
> commit f9bf352224d7 ("userfaultfd: simplify fault handling", 2020-08-03)
> handle_userfault() could return even before page fault resolved.  However it
> should be good enough in most cases because even if it happens, it'll fault
> into handle_userfault() again, then we just got one more count.
>
> Thanks!
>
Peter, thanks for idea, now I've also tried with kretprobe, for Windows 10
and Ubuntu 20.04 guests, two runs for each. Windows is ugly here(

First are series of runs without scan-rate-limiting.patch.
Windows 10:

      msecs               : count     distribution
          0 -> 1          : 131913   |****************************************|
          2 -> 3          : 106      |                                        |
          4 -> 7          : 362      |                                        |
          8 -> 15         : 619      |                                        |
         16 -> 31         : 28       |                                        |
         32 -> 63         : 1        |                                        |
         64 -> 127        : 2        |                                        |


      msecs               : count     distribution
          0 -> 1          : 199273   |****************************************|
          2 -> 3          : 190      |                                        |
          4 -> 7          : 425      |                                        |
          8 -> 15         : 927      |                                        |
         16 -> 31         : 69       |                                        |
         32 -> 63         : 3        |                                        |
         64 -> 127        : 16       |                                        |
        128 -> 255        : 2        |                                        |

Ubuntu 20.04:

      msecs               : count     distribution
          0 -> 1          : 104954   |****************************************|
          2 -> 3          : 9        |                                        |

      msecs               : count     distribution
          0 -> 1          : 147159   |****************************************|
          2 -> 3          : 13       |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 0        |                                        |
         16 -> 31         : 0        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 1        |                                        |


Here are runs with scan-rate-limiting.patch.
Windows 10:

      msecs               : count     distribution
          0 -> 1          : 234492   |****************************************|
          2 -> 3          : 66       |                                        |
          4 -> 7          : 219      |                                        |
          8 -> 15         : 109      |                                        |
         16 -> 31         : 0        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 1        |                                        |

      msecs               : count     distribution
          0 -> 1          : 183171   |****************************************|
          2 -> 3          : 109      |                                        |
          4 -> 7          : 281      |                                        |
          8 -> 15         : 444      |                                        |
         16 -> 31         : 3        |                                        |
         32 -> 63         : 1        |                                        |

Ubuntu 20.04:

      msecs               : count     distribution
          0 -> 1          : 92224    |****************************************|
          2 -> 3          : 9        |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 0        |                                        |
         16 -> 31         : 1        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 1        |                                        |

      msecs               : count     distribution
          0 -> 1          : 97021    |****************************************|
          2 -> 3          : 7        |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 0        |                                        |
         16 -> 31         : 0        |                                        |
         32 -> 63         : 0        |                                        |
         64 -> 127        : 0        |                                        |
        128 -> 255        : 1        |                                        |

So, initial variant of rate-limiting makes some positive effect, but not very
noticible. Interesting is the case of Windows guest, why the difference is so large,
compared to Linux. The reason (theoretically) might be some of virtio or QXL drivers,
hard to say. At least Windows VM has been configured with a set of Hyper-V
enlightments, there's nothing to improve in domain config.

For Linux guests latencies are good enough without any additional efforts.

Also, I've missed some code to deal with snapshotting of suspended guest, so I'll made
v7 series with the fix and also try to add more effective solution to reduce millisecond-grade
latencies.

And yes, I've used bpftrace-like tool - BCC from iovisor with python frontend. Seems a bit more
friendly then bpftrace.
Peter Xu Dec. 16, 2020, 9:02 p.m. UTC | #5
On Tue, Dec 15, 2020 at 10:53:13PM +0300, Andrey Gruzdev wrote:
> First are series of runs without scan-rate-limiting.patch.
> Windows 10:
> 
>      msecs               : count     distribution
>          0 -> 1          : 131913   |****************************************|
>          2 -> 3          : 106      |                                        |
>          4 -> 7          : 362      |                                        |
>          8 -> 15         : 619      |                                        |
>         16 -> 31         : 28       |                                        |
>         32 -> 63         : 1        |                                        |
>         64 -> 127        : 2        |                                        |
> 
> 
>      msecs               : count     distribution
>          0 -> 1          : 199273   |****************************************|
>          2 -> 3          : 190      |                                        |
>          4 -> 7          : 425      |                                        |
>          8 -> 15         : 927      |                                        |
>         16 -> 31         : 69       |                                        |
>         32 -> 63         : 3        |                                        |
>         64 -> 127        : 16       |                                        |
>        128 -> 255        : 2        |                                        |
> 
> Ubuntu 20.04:
> 
>      msecs               : count     distribution
>          0 -> 1          : 104954   |****************************************|
>          2 -> 3          : 9        |                                        |
> 
>      msecs               : count     distribution
>          0 -> 1          : 147159   |****************************************|
>          2 -> 3          : 13       |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 1        |                                        |
> 
> 
> Here are runs with scan-rate-limiting.patch.
> Windows 10:
> 
>      msecs               : count     distribution
>          0 -> 1          : 234492   |****************************************|
>          2 -> 3          : 66       |                                        |
>          4 -> 7          : 219      |                                        |
>          8 -> 15         : 109      |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 1        |                                        |
> 
>      msecs               : count     distribution
>          0 -> 1          : 183171   |****************************************|
>          2 -> 3          : 109      |                                        |
>          4 -> 7          : 281      |                                        |
>          8 -> 15         : 444      |                                        |
>         16 -> 31         : 3        |                                        |
>         32 -> 63         : 1        |                                        |
> 
> Ubuntu 20.04:
> 
>      msecs               : count     distribution
>          0 -> 1          : 92224    |****************************************|
>          2 -> 3          : 9        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 1        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 1        |                                        |
> 
>      msecs               : count     distribution
>          0 -> 1          : 97021    |****************************************|
>          2 -> 3          : 7        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 1        |                                        |
> 
> So, initial variant of rate-limiting makes some positive effect, but not very
> noticible. Interesting is the case of Windows guest, why the difference is so large,
> compared to Linux. The reason (theoretically) might be some of virtio or QXL drivers,
> hard to say. At least Windows VM has been configured with a set of Hyper-V
> enlightments, there's nothing to improve in domain config.
> 
> For Linux guests latencies are good enough without any additional efforts.

Interesting...

> 
> Also, I've missed some code to deal with snapshotting of suspended guest, so I'll made
> v7 series with the fix and also try to add more effective solution to reduce millisecond-grade
> latencies.
> 
> And yes, I've used bpftrace-like tool - BCC from iovisor with python frontend. Seems a bit more
> friendly then bpftrace.

Do you think it's a good idea to also include your measurement script when
posting v7?  It could be a well fit for scripts/, I think.

Seems 6.0 dev window is open; hopefully Dave or Juan would have time to look at
this series soon.

Thanks,
Andrey Gruzdev Dec. 17, 2020, 7:50 a.m. UTC | #6
On 17.12.2020 00:02, Peter Xu wrote:
> On Tue, Dec 15, 2020 at 10:53:13PM +0300, Andrey Gruzdev wrote:
>> First are series of runs without scan-rate-limiting.patch.
>> Windows 10:
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 131913   |****************************************|
>>           2 -> 3          : 106      |                                        |
>>           4 -> 7          : 362      |                                        |
>>           8 -> 15         : 619      |                                        |
>>          16 -> 31         : 28       |                                        |
>>          32 -> 63         : 1        |                                        |
>>          64 -> 127        : 2        |                                        |
>>
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 199273   |****************************************|
>>           2 -> 3          : 190      |                                        |
>>           4 -> 7          : 425      |                                        |
>>           8 -> 15         : 927      |                                        |
>>          16 -> 31         : 69       |                                        |
>>          32 -> 63         : 3        |                                        |
>>          64 -> 127        : 16       |                                        |
>>         128 -> 255        : 2        |                                        |
>>
>> Ubuntu 20.04:
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 104954   |****************************************|
>>           2 -> 3          : 9        |                                        |
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 147159   |****************************************|
>>           2 -> 3          : 13       |                                        |
>>           4 -> 7          : 0        |                                        |
>>           8 -> 15         : 0        |                                        |
>>          16 -> 31         : 0        |                                        |
>>          32 -> 63         : 0        |                                        |
>>          64 -> 127        : 1        |                                        |
>>
>>
>> Here are runs with scan-rate-limiting.patch.
>> Windows 10:
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 234492   |****************************************|
>>           2 -> 3          : 66       |                                        |
>>           4 -> 7          : 219      |                                        |
>>           8 -> 15         : 109      |                                        |
>>          16 -> 31         : 0        |                                        |
>>          32 -> 63         : 0        |                                        |
>>          64 -> 127        : 1        |                                        |
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 183171   |****************************************|
>>           2 -> 3          : 109      |                                        |
>>           4 -> 7          : 281      |                                        |
>>           8 -> 15         : 444      |                                        |
>>          16 -> 31         : 3        |                                        |
>>          32 -> 63         : 1        |                                        |
>>
>> Ubuntu 20.04:
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 92224    |****************************************|
>>           2 -> 3          : 9        |                                        |
>>           4 -> 7          : 0        |                                        |
>>           8 -> 15         : 0        |                                        |
>>          16 -> 31         : 1        |                                        |
>>          32 -> 63         : 0        |                                        |
>>          64 -> 127        : 1        |                                        |
>>
>>       msecs               : count     distribution
>>           0 -> 1          : 97021    |****************************************|
>>           2 -> 3          : 7        |                                        |
>>           4 -> 7          : 0        |                                        |
>>           8 -> 15         : 0        |                                        |
>>          16 -> 31         : 0        |                                        |
>>          32 -> 63         : 0        |                                        |
>>          64 -> 127        : 0        |                                        |
>>         128 -> 255        : 1        |                                        |
>>
>> So, initial variant of rate-limiting makes some positive effect, but not very
>> noticible. Interesting is the case of Windows guest, why the difference is so large,
>> compared to Linux. The reason (theoretically) might be some of virtio or QXL drivers,
>> hard to say. At least Windows VM has been configured with a set of Hyper-V
>> enlightments, there's nothing to improve in domain config.
>>
>> For Linux guests latencies are good enough without any additional efforts.
> Interesting...
>
>> Also, I've missed some code to deal with snapshotting of suspended guest, so I'll made
>> v7 series with the fix and also try to add more effective solution to reduce millisecond-grade
>> latencies.
>>
>> And yes, I've used bpftrace-like tool - BCC from iovisor with python frontend. Seems a bit more
>> friendly then bpftrace.
> Do you think it's a good idea to also include your measurement script when
> posting v7?  It could be a well fit for scripts/, I think.
>
> Seems 6.0 dev window is open; hopefully Dave or Juan would have time to look at
> this series soon.
>
> Thanks,
>
Yes, I think it's good to have this script in the same tree.

Nice news about 6.0! For v7, I've added a fix for suspended guest since previously we'd finish
in assert() on invalid runstate transition when doing initial vm_stop_force_state(RUN_STATE_PAUSED)).
Also disabled dirty page logging and log syncing for background snapshots.

Thanks,