mbox series

[v4,0/6] UFFD write-tracking migration/snapshots

Message ID 20201126151734.743849-1-andrey.gruzdev@virtuozzo.com (mailing list archive)
Headers show
Series UFFD write-tracking migration/snapshots | expand

Message

Andrey Gruzdev Nov. 26, 2020, 3:17 p.m. UTC
This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Currently the only way to make (external) live VM snapshot is using existing
dirty page logging migration mechanism. The main problem is that it tends to
produce a lot of page duplicates while running VM goes on updating already
saved pages. That leads to the fact that vmstate image size is commonly several
times bigger then non-zero part of virtual machine's RSS. Time required to
converge RAM migration and the size of snapshot image severely depend on the
guest memory write rate, sometimes resulting in unacceptably long snapshot
creation time and huge image size.

This series propose a way to solve the aforementioned problems. This is done
by using different RAM migration mechanism based on UFFD write protection
management introduced in v5.7 kernel. The migration strategy is to 'freeze'
guest RAM content using write-protection and iteratively release protection
for memory ranges that have already been saved to the migration stream.
At the same time we read in pending UFFD write fault events and save those
pages out-of-order with higher priority.

How to use:
1. Enable write-tracking migration capability
   virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
track-writes-ram on

2. Start the external migration to a file
   virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'

3. Wait for the migration finish and check that the migration has completed.
state.

Changes v3->v4:

* 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'.
* 2. Use array of incompatible caps to replace bulky 'if' constructs.
* 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c').
* 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor
*    won't cleanup PTEs anyhow, it will release registration ranges, wait 
*    queues etc. but won't cleanup process MM context on MMU level.
* 5. Allow to enable 'background-snapshot' capability on Linux-only hosts.
* 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite.
* 7. Removed 'wt_' from RAMState struct.
* 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD
*    wr-fault events in get_queued_page(), use ram_save_host_page_pre(),
*    ram_save_host_page_post() notifiers around ram_save_host_page()
*    instead of bulky inline write-unprotect code.

Andrey Gruzdev (6):
  introduce 'background-snapshot' migration capability
  introduce UFFD-WP low-level interface helpers
  support UFFD write fault processing in ram_save_iterate()
  implementation of background snapshot thread
  the rest of write tracking migration code
  introduce simple linear scan rate limiting mechanism

 include/exec/memory.h      |   7 +
 include/qemu/userfaultfd.h |  29 ++++
 migration/migration.c      | 314 +++++++++++++++++++++++++++++++++-
 migration/migration.h      |   4 +
 migration/ram.c            | 334 ++++++++++++++++++++++++++++++++++++-
 migration/ram.h            |   4 +
 migration/savevm.c         |   1 -
 migration/savevm.h         |   2 +
 qapi/migration.json        |   7 +-
 util/meson.build           |   1 +
 util/userfaultfd.c         | 215 ++++++++++++++++++++++++
 11 files changed, 908 insertions(+), 10 deletions(-)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100644 util/userfaultfd.c

Comments

Peter Krempa Nov. 26, 2020, 3:47 p.m. UTC | #1
On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> 
> Currently the only way to make (external) live VM snapshot is using existing
> dirty page logging migration mechanism. The main problem is that it tends to
> produce a lot of page duplicates while running VM goes on updating already
> saved pages. That leads to the fact that vmstate image size is commonly several
> times bigger then non-zero part of virtual machine's RSS. Time required to
> converge RAM migration and the size of snapshot image severely depend on the
> guest memory write rate, sometimes resulting in unacceptably long snapshot
> creation time and huge image size.
> 
> This series propose a way to solve the aforementioned problems. This is done
> by using different RAM migration mechanism based on UFFD write protection
> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> guest RAM content using write-protection and iteratively release protection
> for memory ranges that have already been saved to the migration stream.
> At the same time we read in pending UFFD write fault events and save those
> pages out-of-order with higher priority.

This sounds amazing! Based on your description I assume that the final
memory image contains state image from the beginning of the migration.

This would make it desirable for the 'migrate' qmp command to be used as
part of a 'transaction' qmp command so that we can do an instant disk
and memory snapshot without any extraneous pausing of the VM.

I'll have a look at using this mechanism in libvirt natively.
Andrey Gruzdev Nov. 27, 2020, 8:21 a.m. UTC | #2
On 26.11.2020 18:47, Peter Krempa wrote:
> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>>
>> Currently the only way to make (external) live VM snapshot is using existing
>> dirty page logging migration mechanism. The main problem is that it tends to
>> produce a lot of page duplicates while running VM goes on updating already
>> saved pages. That leads to the fact that vmstate image size is commonly several
>> times bigger then non-zero part of virtual machine's RSS. Time required to
>> converge RAM migration and the size of snapshot image severely depend on the
>> guest memory write rate, sometimes resulting in unacceptably long snapshot
>> creation time and huge image size.
>>
>> This series propose a way to solve the aforementioned problems. This is done
>> by using different RAM migration mechanism based on UFFD write protection
>> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
>> guest RAM content using write-protection and iteratively release protection
>> for memory ranges that have already been saved to the migration stream.
>> At the same time we read in pending UFFD write fault events and save those
>> pages out-of-order with higher priority.
> 
> This sounds amazing! Based on your description I assume that the final
> memory image contains state image from the beginning of the migration.
> 
> This would make it desirable for the 'migrate' qmp command to be used as
> part of a 'transaction' qmp command so that we can do an instant disk
> and memory snapshot without any extraneous pausing of the VM.
> 
> I'll have a look at using this mechanism in libvirt natively.
> 

Correct, the final image contains state at the beginning of migration.

So far, if I'm not missing something about libvirt, for external 
snapshot creation it performs a sequence like that: 
migrate(fd)->transaction(blockdev-snapshot-all)->cont.

So, in case 'background-snapshot' capability is enabled, the sequence 
would change to:
stop->transaction(blockdev-snapshot-all)->migrate(fd).
With background snapshot migration it will finish with VM running so 
there's not need to 'cont' here.

Thanks, I'd appreciate a lot if you look how to integrate the new 
mechanism to libvirt!
Peter Krempa Nov. 27, 2020, 9:49 a.m. UTC | #3
On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
> On 26.11.2020 18:47, Peter Krempa wrote:
> > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> > > 
> > > Currently the only way to make (external) live VM snapshot is using existing
> > > dirty page logging migration mechanism. The main problem is that it tends to
> > > produce a lot of page duplicates while running VM goes on updating already
> > > saved pages. That leads to the fact that vmstate image size is commonly several
> > > times bigger then non-zero part of virtual machine's RSS. Time required to
> > > converge RAM migration and the size of snapshot image severely depend on the
> > > guest memory write rate, sometimes resulting in unacceptably long snapshot
> > > creation time and huge image size.
> > > 
> > > This series propose a way to solve the aforementioned problems. This is done
> > > by using different RAM migration mechanism based on UFFD write protection
> > > management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> > > guest RAM content using write-protection and iteratively release protection
> > > for memory ranges that have already been saved to the migration stream.
> > > At the same time we read in pending UFFD write fault events and save those
> > > pages out-of-order with higher priority.
> > 
> > This sounds amazing! Based on your description I assume that the final
> > memory image contains state image from the beginning of the migration.
> > 
> > This would make it desirable for the 'migrate' qmp command to be used as
> > part of a 'transaction' qmp command so that we can do an instant disk
> > and memory snapshot without any extraneous pausing of the VM.
> > 
> > I'll have a look at using this mechanism in libvirt natively.
> > 
> 
> Correct, the final image contains state at the beginning of migration.
> 
> So far, if I'm not missing something about libvirt, for external snapshot
> creation it performs a sequence like that:
> migrate(fd)->transaction(blockdev-snapshot-all)->cont.
> 
> So, in case 'background-snapshot' capability is enabled, the sequence would
> change to:
> stop->transaction(blockdev-snapshot-all)->migrate(fd).
> With background snapshot migration it will finish with VM running so there's
> not need to 'cont' here.

Yes, that's correct.

The reason I've suggested that 'migrate' being part of a 'transaction'
is that it would remove the need to stop it for the disk snapshot part.
Andrey Gruzdev Nov. 27, 2020, 10 a.m. UTC | #4
On 27.11.2020 12:49, Peter Krempa wrote:
> On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
>> On 26.11.2020 18:47, Peter Krempa wrote:
>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>>>>
>>>> Currently the only way to make (external) live VM snapshot is using existing
>>>> dirty page logging migration mechanism. The main problem is that it tends to
>>>> produce a lot of page duplicates while running VM goes on updating already
>>>> saved pages. That leads to the fact that vmstate image size is commonly several
>>>> times bigger then non-zero part of virtual machine's RSS. Time required to
>>>> converge RAM migration and the size of snapshot image severely depend on the
>>>> guest memory write rate, sometimes resulting in unacceptably long snapshot
>>>> creation time and huge image size.
>>>>
>>>> This series propose a way to solve the aforementioned problems. This is done
>>>> by using different RAM migration mechanism based on UFFD write protection
>>>> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
>>>> guest RAM content using write-protection and iteratively release protection
>>>> for memory ranges that have already been saved to the migration stream.
>>>> At the same time we read in pending UFFD write fault events and save those
>>>> pages out-of-order with higher priority.
>>>
>>> This sounds amazing! Based on your description I assume that the final
>>> memory image contains state image from the beginning of the migration.
>>>
>>> This would make it desirable for the 'migrate' qmp command to be used as
>>> part of a 'transaction' qmp command so that we can do an instant disk
>>> and memory snapshot without any extraneous pausing of the VM.
>>>
>>> I'll have a look at using this mechanism in libvirt natively.
>>>
>>
>> Correct, the final image contains state at the beginning of migration.
>>
>> So far, if I'm not missing something about libvirt, for external snapshot
>> creation it performs a sequence like that:
>> migrate(fd)->transaction(blockdev-snapshot-all)->cont.
>>
>> So, in case 'background-snapshot' capability is enabled, the sequence would
>> change to:
>> stop->transaction(blockdev-snapshot-all)->migrate(fd).
>> With background snapshot migration it will finish with VM running so there's
>> not need to 'cont' here.
> 
> Yes, that's correct.
> 
> The reason I've suggested that 'migrate' being part of a 'transaction'
> is that it would remove the need to stop it for the disk snapshot part.
> 

Hmm, I believe stopping VM for a short time is unavoidable to keep saved 
device state consistent with blockdev snapshot base.. May be I've missed 
something but it seems logical.
Peter Xu Nov. 27, 2020, 3:45 p.m. UTC | #5
On Fri, Nov 27, 2020 at 01:00:48PM +0300, Andrey Gruzdev wrote:
> On 27.11.2020 12:49, Peter Krempa wrote:
> > On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
> > > On 26.11.2020 18:47, Peter Krempa wrote:
> > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > > > > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> > > > > 
> > > > > Currently the only way to make (external) live VM snapshot is using existing
> > > > > dirty page logging migration mechanism. The main problem is that it tends to
> > > > > produce a lot of page duplicates while running VM goes on updating already
> > > > > saved pages. That leads to the fact that vmstate image size is commonly several
> > > > > times bigger then non-zero part of virtual machine's RSS. Time required to
> > > > > converge RAM migration and the size of snapshot image severely depend on the
> > > > > guest memory write rate, sometimes resulting in unacceptably long snapshot
> > > > > creation time and huge image size.
> > > > > 
> > > > > This series propose a way to solve the aforementioned problems. This is done
> > > > > by using different RAM migration mechanism based on UFFD write protection
> > > > > management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> > > > > guest RAM content using write-protection and iteratively release protection
> > > > > for memory ranges that have already been saved to the migration stream.
> > > > > At the same time we read in pending UFFD write fault events and save those
> > > > > pages out-of-order with higher priority.
> > > > 
> > > > This sounds amazing! Based on your description I assume that the final
> > > > memory image contains state image from the beginning of the migration.
> > > > 
> > > > This would make it desirable for the 'migrate' qmp command to be used as
> > > > part of a 'transaction' qmp command so that we can do an instant disk
> > > > and memory snapshot without any extraneous pausing of the VM.
> > > > 
> > > > I'll have a look at using this mechanism in libvirt natively.
> > > > 
> > > 
> > > Correct, the final image contains state at the beginning of migration.
> > > 
> > > So far, if I'm not missing something about libvirt, for external snapshot
> > > creation it performs a sequence like that:
> > > migrate(fd)->transaction(blockdev-snapshot-all)->cont.
> > > 
> > > So, in case 'background-snapshot' capability is enabled, the sequence would
> > > change to:
> > > stop->transaction(blockdev-snapshot-all)->migrate(fd).
> > > With background snapshot migration it will finish with VM running so there's
> > > not need to 'cont' here.
> > 
> > Yes, that's correct.
> > 
> > The reason I've suggested that 'migrate' being part of a 'transaction'
> > is that it would remove the need to stop it for the disk snapshot part.
> > 
> 
> Hmm, I believe stopping VM for a short time is unavoidable to keep saved
> device state consistent with blockdev snapshot base.. May be I've missed
> something but it seems logical.

I guess PeterK meant an explicit stop vm command, rather than the stop in this
series that should be undetectable from the guest.  It would be definitely
great if PeterK could quickly follow up with this on libvirt. :)

One overall comment (before go into details) is that please remember to start
using "migration: " prefixes in patch subjects from the next version.

Thanks,
Andrey Gruzdev Nov. 27, 2020, 5:19 p.m. UTC | #6
On 27.11.2020 18:45, Peter Xu wrote:
> On Fri, Nov 27, 2020 at 01:00:48PM +0300, Andrey Gruzdev wrote:
>> On 27.11.2020 12:49, Peter Krempa wrote:
>>> On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
>>>> On 26.11.2020 18:47, Peter Krempa wrote:
>>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>>>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>>>>>>
>>>>>> Currently the only way to make (external) live VM snapshot is using existing
>>>>>> dirty page logging migration mechanism. The main problem is that it tends to
>>>>>> produce a lot of page duplicates while running VM goes on updating already
>>>>>> saved pages. That leads to the fact that vmstate image size is commonly several
>>>>>> times bigger then non-zero part of virtual machine's RSS. Time required to
>>>>>> converge RAM migration and the size of snapshot image severely depend on the
>>>>>> guest memory write rate, sometimes resulting in unacceptably long snapshot
>>>>>> creation time and huge image size.
>>>>>>
>>>>>> This series propose a way to solve the aforementioned problems. This is done
>>>>>> by using different RAM migration mechanism based on UFFD write protection
>>>>>> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
>>>>>> guest RAM content using write-protection and iteratively release protection
>>>>>> for memory ranges that have already been saved to the migration stream.
>>>>>> At the same time we read in pending UFFD write fault events and save those
>>>>>> pages out-of-order with higher priority.
>>>>>
>>>>> This sounds amazing! Based on your description I assume that the final
>>>>> memory image contains state image from the beginning of the migration.
>>>>>
>>>>> This would make it desirable for the 'migrate' qmp command to be used as
>>>>> part of a 'transaction' qmp command so that we can do an instant disk
>>>>> and memory snapshot without any extraneous pausing of the VM.
>>>>>
>>>>> I'll have a look at using this mechanism in libvirt natively.
>>>>>
>>>>
>>>> Correct, the final image contains state at the beginning of migration.
>>>>
>>>> So far, if I'm not missing something about libvirt, for external snapshot
>>>> creation it performs a sequence like that:
>>>> migrate(fd)->transaction(blockdev-snapshot-all)->cont.
>>>>
>>>> So, in case 'background-snapshot' capability is enabled, the sequence would
>>>> change to:
>>>> stop->transaction(blockdev-snapshot-all)->migrate(fd).
>>>> With background snapshot migration it will finish with VM running so there's
>>>> not need to 'cont' here.
>>>
>>> Yes, that's correct.
>>>
>>> The reason I've suggested that 'migrate' being part of a 'transaction'
>>> is that it would remove the need to stop it for the disk snapshot part.
>>>
>>
>> Hmm, I believe stopping VM for a short time is unavoidable to keep saved
>> device state consistent with blockdev snapshot base.. May be I've missed
>> something but it seems logical.
> 
> I guess PeterK meant an explicit stop vm command, rather than the stop in this
> series that should be undetectable from the guest.  It would be definitely
> great if PeterK could quickly follow up with this on libvirt. :)
> 

Sorry if I misunderstood, sure it would be great if PeterK is really 
interested to follow up with this feature!

> One overall comment (before go into details) is that please remember to start
> using "migration: " prefixes in patch subjects from the next version.
> 

Yes, I really missed it everywhere :)

> Thanks,
>
Peter Xu Nov. 27, 2020, 10:04 p.m. UTC | #7
On Thu, Nov 26, 2020 at 06:17:28PM +0300, Andrey Gruzdev wrote:
> Changes v3->v4:
> 
> * 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'.
> * 2. Use array of incompatible caps to replace bulky 'if' constructs.
> * 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c').
> * 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor
> *    won't cleanup PTEs anyhow, it will release registration ranges, wait 
> *    queues etc. but won't cleanup process MM context on MMU level.
> * 5. Allow to enable 'background-snapshot' capability on Linux-only hosts.
> * 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite.
> * 7. Removed 'wt_' from RAMState struct.
> * 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD
> *    wr-fault events in get_queued_page(), use ram_save_host_page_pre(),
> *    ram_save_host_page_post() notifiers around ram_save_host_page()
> *    instead of bulky inline write-unprotect code.

One thing I mentioned previously but it seems still got lost is that we don't
need dirty tracking for live snapshot.

A few pointers for reference:

  memory_global_dirty_log_start()
  migration_bitmap_sync_precopy()
  memory_region_clear_dirty_bitmap()
  ...

These should not be needed.  But this can also be done on top.

Thanks,
Andrey Gruzdev Nov. 30, 2020, 8:07 a.m. UTC | #8
On 28.11.2020 01:04, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:28PM +0300, Andrey Gruzdev wrote:
>> Changes v3->v4:
>>
>> * 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'.
>> * 2. Use array of incompatible caps to replace bulky 'if' constructs.
>> * 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c').
>> * 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor
>> *    won't cleanup PTEs anyhow, it will release registration ranges, wait
>> *    queues etc. but won't cleanup process MM context on MMU level.
>> * 5. Allow to enable 'background-snapshot' capability on Linux-only hosts.
>> * 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite.
>> * 7. Removed 'wt_' from RAMState struct.
>> * 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD
>> *    wr-fault events in get_queued_page(), use ram_save_host_page_pre(),
>> *    ram_save_host_page_post() notifiers around ram_save_host_page()
>> *    instead of bulky inline write-unprotect code.
> 
> One thing I mentioned previously but it seems still got lost is that we don't
> need dirty tracking for live snapshot.
> 
> A few pointers for reference:
> 
>    memory_global_dirty_log_start()
>    migration_bitmap_sync_precopy()
>    memory_region_clear_dirty_bitmap()
>    ...
> 
> These should not be needed.  But this can also be done on top.
> 
> Thanks,
> 

Yes, all off this is really not needed. At least it seems that enabled 
dirty page logging in KVM makes not too much overhead and 
migration_bitmap_sync_precopy() is called only once. Let's do it on top.
Peter Krempa Dec. 1, 2020, 7:08 a.m. UTC | #9
On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Hi,

I gave this a try when attempting to implement the libvirt code for this
feature. I've ran into a problem of migration failing right away. The
VM's cpus were running at that point.

QEMU logged the following to stdout/err:

2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: start=7f2007fff000 len=33554432000 mode=2 errno=22
2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() failed: restoring initial memory state
2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: start=7f2007fff000 len=33554432000 mode=0 errno=2
2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() failed: start=7f2007fff000 len=33554432000 errno=22


The migration was started by the following QMP conversation:


QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-14"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-15"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"}

QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0
QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-16"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"}

QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": "setup"}}
QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-17"}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": "active"}}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966488}, "event": "STOP"}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": "failed"}}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"query-migrate","id":"libvirt-18"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": "failed"}, "id": "libvirt-18"}
qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly failed

 $ uname -r
5.8.18-300.fc33.x86_64


created by libvirt with the following patchset applied:

https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot

git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot

Start the snapshot via:

virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no --diskspec sda,snapshot=no --no-metadata upstream

Note you can omit --diskspec if you have a diskless VM.

The patches are VERY work in progress as I need to figure out the proper
sequencing to ensure a consistent snapshot.

Note that in cases when qemu can't guarantee that the
background_snapshot feature will work it should not advertise it. We
need a way to check whether it's possible to use it, so we can replace
the existing --live flag with it rather than adding a new one and
shifting the problem of checking whether the feature works to the user.
Andrey Gruzdev Dec. 1, 2020, 8:42 a.m. UTC | #10
On 01.12.2020 10:08, Peter Krempa wrote:
> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> 
> Hi,
> 
> I gave this a try when attempting to implement the libvirt code for this
> feature. I've ran into a problem of migration failing right away. The
> VM's cpus were running at that point.
> 
> QEMU logged the following to stdout/err:
> 
> 2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: start=7f2007fff000 len=33554432000 mode=2 errno=22
> 2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() failed: restoring initial memory state
> 2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: start=7f2007fff000 len=33554432000 mode=0 errno=2
> 2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() failed: start=7f2007fff000 len=33554432000 errno=22
> 
> 
> The migration was started by the following QMP conversation:
> 
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"}
> 
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-14"}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"}
> 
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-15"}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"}
> 
> QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-16"}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"}
> 
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": "setup"}}
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-17"}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": "active"}}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966488}, "event": "STOP"}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": "failed"}}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"query-migrate","id":"libvirt-18"}
> 
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": "failed"}, "id": "libvirt-18"}
> qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly failed
> 
>   $ uname -r
> 5.8.18-300.fc33.x86_64
> 
> 
> created by libvirt with the following patchset applied:
> 
> https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot
> 
> git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot
> 
> Start the snapshot via:
> 
> virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no --diskspec sda,snapshot=no --no-metadata upstream
> 
> Note you can omit --diskspec if you have a diskless VM.
> 
> The patches are VERY work in progress as I need to figure out the proper
> sequencing to ensure a consistent snapshot.
> 
> Note that in cases when qemu can't guarantee that the
> background_snapshot feature will work it should not advertise it. We
> need a way to check whether it's possible to use it, so we can replace
> the existing --live flag with it rather than adding a new one and
> shifting the problem of checking whether the feature works to the user.
> 

Hi,

May be you are using hugetlbfs as memory backend?

I totally agree that we need somehow check that kernel and VM memory 
backend support the feature before one can enable the capability.
Need to think about that..

Thanks,
Peter Krempa Dec. 1, 2020, 10:53 a.m. UTC | #11
On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> On 01.12.2020 10:08, Peter Krempa wrote:
> > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's

[...]

> > Note that in cases when qemu can't guarantee that the
> > background_snapshot feature will work it should not advertise it. We
> > need a way to check whether it's possible to use it, so we can replace
> > the existing --live flag with it rather than adding a new one and
> > shifting the problem of checking whether the feature works to the user.
> > 
> 
> Hi,
> 
> May be you are using hugetlbfs as memory backend?

Not exactly hugepages, but I had:

  <memoryBacking>
    <access mode='shared'/>
  </memoryBacking>

which resulted into the following commandline to instantiate memory:

-object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \

When I've removed it I got:

-object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \

And the migration didn't fail in my quick test. I'll have a more
detailed look later, thanks for the pointer.

> I totally agree that we need somehow check that kernel and VM memory backend
> support the feature before one can enable the capability.
> Need to think about that..

Definitely. Also note that memory backed by memory-backend-file will be
more and more common, for cases such as virtiofs DAX sharing and
similar.
Andrey Gruzdev Dec. 1, 2020, 11:24 a.m. UTC | #12
On 01.12.2020 13:53, Peter Krempa wrote:
> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
>> On 01.12.2020 10:08, Peter Krempa wrote:
>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> 
> [...]
> 
>>> Note that in cases when qemu can't guarantee that the
>>> background_snapshot feature will work it should not advertise it. We
>>> need a way to check whether it's possible to use it, so we can replace
>>> the existing --live flag with it rather than adding a new one and
>>> shifting the problem of checking whether the feature works to the user.
>>>
>>
>> Hi,
>>
>> May be you are using hugetlbfs as memory backend?
> 
> Not exactly hugepages, but I had:
> 
>    <memoryBacking>
>      <access mode='shared'/>
>    </memoryBacking>
> 
> which resulted into the following commandline to instantiate memory:
> 
> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> 
> When I've removed it I got:
> 
> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> 
> And the migration didn't fail in my quick test. I'll have a more
> detailed look later, thanks for the pointer.
> 

Yep, seems that current userfaultfd supports hugetlbfs and shared memory 
for missing pages but not for wr-protected..

>> I totally agree that we need somehow check that kernel and VM memory backend
>> support the feature before one can enable the capability.
>> Need to think about that..
> 
> Definitely. Also note that memory backed by memory-backend-file will be
> more and more common, for cases such as virtiofs DAX sharing and
> similar.
> 

I see.. That needs support from kernel side, so far 
'background-snapshots' are incompatible with memory-backend-file sharing.
Dr. David Alan Gilbert Dec. 1, 2020, 6:40 p.m. UTC | #13
* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 01.12.2020 13:53, Peter Krempa wrote:
> > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> > > On 01.12.2020 10:08, Peter Krempa wrote:
> > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > 
> > [...]
> > 
> > > > Note that in cases when qemu can't guarantee that the
> > > > background_snapshot feature will work it should not advertise it. We
> > > > need a way to check whether it's possible to use it, so we can replace
> > > > the existing --live flag with it rather than adding a new one and
> > > > shifting the problem of checking whether the feature works to the user.
> > > > 
> > > 
> > > Hi,
> > > 
> > > May be you are using hugetlbfs as memory backend?
> > 
> > Not exactly hugepages, but I had:
> > 
> >    <memoryBacking>
> >      <access mode='shared'/>
> >    </memoryBacking>
> > 
> > which resulted into the following commandline to instantiate memory:
> > 
> > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> > 
> > When I've removed it I got:
> > 
> > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> > 
> > And the migration didn't fail in my quick test. I'll have a more
> > detailed look later, thanks for the pointer.
> > 
> 
> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> missing pages but not for wr-protected..

For hugepages, you'd need kernel support - but also you'd want to make
sure you write the whole hugepage at once.

For shared, there's a harder problem to ask; what happens if RAM is
written by the other process - for postcopy, we get the other process
to send us a userfaultfd that they have registered with their VM.

Dave

> > > I totally agree that we need somehow check that kernel and VM memory backend
> > > support the feature before one can enable the capability.
> > > Need to think about that..
> > 
> > Definitely. Also note that memory backed by memory-backend-file will be
> > more and more common, for cases such as virtiofs DAX sharing and
> > similar.
> > 
> 
> I see.. That needs support from kernel side, so far 'background-snapshots'
> are incompatible with memory-backend-file sharing.
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
>
Peter Xu Dec. 1, 2020, 6:54 p.m. UTC | #14
On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote:
> On 01.12.2020 13:53, Peter Krempa wrote:
> > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> > > On 01.12.2020 10:08, Peter Krempa wrote:
> > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > 
> > [...]
> > 
> > > > Note that in cases when qemu can't guarantee that the
> > > > background_snapshot feature will work it should not advertise it. We
> > > > need a way to check whether it's possible to use it, so we can replace
> > > > the existing --live flag with it rather than adding a new one and
> > > > shifting the problem of checking whether the feature works to the user.

Would it be fine if libvirt just try the new way first anyways?  Since if it
will fail, it'll fail right away on any unsupported memory types, then
logically the libvirt user may not even notice we've retried.

Previously I thought it was enough, because so far the kernel does not have a
specific flag showing whether such type of memory is supported.  But I don't
know whether it would be non-trivial for libvirt to retry like that.

Another solution is to let qemu test the uffd ioctls right after QEMU memory
setup, so we know whether background/live snapshot is supported or not with
current memory backends.  We should need to try this for every ramblock because
I think we can have different types across all the qemu ramblocks.

> > > > 
> > > 
> > > Hi,
> > > 
> > > May be you are using hugetlbfs as memory backend?
> > 
> > Not exactly hugepages, but I had:
> > 
> >    <memoryBacking>
> >      <access mode='shared'/>
> >    </memoryBacking>
> > 
> > which resulted into the following commandline to instantiate memory:
> > 
> > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> > 
> > When I've removed it I got:
> > 
> > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> > 
> > And the migration didn't fail in my quick test. I'll have a more
> > detailed look later, thanks for the pointer.
> > 
> 
> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> missing pages but not for wr-protected..

Correct.  Btw, I'm working on both of them recently.  I have a testing kernel
branch, but I don't think it should affect our qemu work, though, since qemu
should do the same irrelevant of the memory type.  We can just test with
anonymous memories, and as long as it works, it should work perfectly on all
the rest of backends (maybe even for other file-backed memory, more below).

> 
> > > I totally agree that we need somehow check that kernel and VM memory backend
> > > support the feature before one can enable the capability.
> > > Need to think about that..
> > 
> > Definitely. Also note that memory backed by memory-backend-file will be
> > more and more common, for cases such as virtiofs DAX sharing and
> > similar.
> > 
> 
> I see.. That needs support from kernel side, so far 'background-snapshots'
> are incompatible with memory-backend-file sharing.

Yes.  So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought
about the rest yet.  Maybe... it's not hard to add uffd-wp for most of the
file-backed memory?  Since afaict the kernel handles wr-protect in a quite
straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the
first to trap all of them.  I'm not sure whether Andrea has thought about that
or even on how to spread the missing usage to more types of backends (maybe
missing is more special in that it needs to consider page caches).  So I'm
copying Andrea too just in case there's further input.

Thanks,
Peter Xu Dec. 1, 2020, 7:22 p.m. UTC | #15
On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote:
> > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> > missing pages but not for wr-protected..
> 
> For hugepages, you'd need kernel support - but also you'd want to make
> sure you write the whole hugepage at once.

Or we can do similar things by splitting the huge pages just like when we
migrate.

I should have overlooked these facts when I replied previusly - we do need the
same logic, but also special care on these special memory types.

> 
> For shared, there's a harder problem to ask; what happens if RAM is
> written by the other process - for postcopy, we get the other process
> to send us a userfaultfd that they have registered with their VM.

Good point... so we should need similar things too.

Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu
background snapshots before we have prepared these utilities, just in case it
got run on some "future" kernels and accidentally got enabled, so the snapshot
files could be corrupted ones.

Is shmem used a lot in libvirt, or is it even a default configuration?
Dr. David Alan Gilbert Dec. 1, 2020, 8 p.m. UTC | #16
* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote:
> > On 01.12.2020 13:53, Peter Krempa wrote:
> > > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> > > > On 01.12.2020 10:08, Peter Krempa wrote:
> > > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > > 
> > > [...]
> > > 
> > > > > Note that in cases when qemu can't guarantee that the
> > > > > background_snapshot feature will work it should not advertise it. We
> > > > > need a way to check whether it's possible to use it, so we can replace
> > > > > the existing --live flag with it rather than adding a new one and
> > > > > shifting the problem of checking whether the feature works to the user.
> 
> Would it be fine if libvirt just try the new way first anyways?  Since if it
> will fail, it'll fail right away on any unsupported memory types, then
> logically the libvirt user may not even notice we've retried.
> 
> Previously I thought it was enough, because so far the kernel does not have a
> specific flag showing whether such type of memory is supported.  But I don't
> know whether it would be non-trivial for libvirt to retry like that.
> 
> Another solution is to let qemu test the uffd ioctls right after QEMU memory
> setup, so we know whether background/live snapshot is supported or not with
> current memory backends.  We should need to try this for every ramblock because
> I think we can have different types across all the qemu ramblocks.

I don't think we actually do that for postcopy; we do some checks like
checking if we have any hugepages, and if so checking for it's flags.
But note that we do tie it into migrate_caps_check to fail if you try
and set the capability.

> 
> > > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > May be you are using hugetlbfs as memory backend?
> > > 
> > > Not exactly hugepages, but I had:
> > > 
> > >    <memoryBacking>
> > >      <access mode='shared'/>
> > >    </memoryBacking>
> > > 
> > > which resulted into the following commandline to instantiate memory:
> > > 
> > > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> > > 
> > > When I've removed it I got:
> > > 
> > > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> > > 
> > > And the migration didn't fail in my quick test. I'll have a more
> > > detailed look later, thanks for the pointer.
> > > 
> > 
> > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> > missing pages but not for wr-protected..
> 
> Correct.  Btw, I'm working on both of them recently.  I have a testing kernel
> branch, but I don't think it should affect our qemu work, though, since qemu
> should do the same irrelevant of the memory type.  We can just test with
> anonymous memories, and as long as it works, it should work perfectly on all
> the rest of backends (maybe even for other file-backed memory, more below).
> 
> > 
> > > > I totally agree that we need somehow check that kernel and VM memory backend
> > > > support the feature before one can enable the capability.
> > > > Need to think about that..
> > > 
> > > Definitely. Also note that memory backed by memory-backend-file will be
> > > more and more common, for cases such as virtiofs DAX sharing and
> > > similar.
> > > 
> > 
> > I see.. That needs support from kernel side, so far 'background-snapshots'
> > are incompatible with memory-backend-file sharing.
> 
> Yes.  So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought
> about the rest yet.  Maybe... it's not hard to add uffd-wp for most of the
> file-backed memory?  Since afaict the kernel handles wr-protect in a quite
> straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the
> first to trap all of them.  I'm not sure whether Andrea has thought about that
> or even on how to spread the missing usage to more types of backends (maybe
> missing is more special in that it needs to consider page caches).  So I'm
> copying Andrea too just in case there's further input.

Some would be good; we've got requests for it to work on pmem mmaped
devices.   You do have to be a little careful about semantics though;
I'm not sure it's that big of a problem for the wp case, but for the
absent case you need to worry about finding an equivalent of madvise or
fallocate that cna punch a hole.

Dave

> Thanks,
> 
> -- 
> Peter Xu
>
Dr. David Alan Gilbert Dec. 1, 2020, 8:01 p.m. UTC | #17
* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote:
> > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> > > missing pages but not for wr-protected..
> > 
> > For hugepages, you'd need kernel support - but also you'd want to make
> > sure you write the whole hugepage at once.
> 
> Or we can do similar things by splitting the huge pages just like when we
> migrate.
> 
> I should have overlooked these facts when I replied previusly - we do need the
> same logic, but also special care on these special memory types.
> 
> > 
> > For shared, there's a harder problem to ask; what happens if RAM is
> > written by the other process - for postcopy, we get the other process
> > to send us a userfaultfd that they have registered with their VM.
> 
> Good point... so we should need similar things too.
> 
> Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu
> background snapshots before we have prepared these utilities, just in case it
> got run on some "future" kernels and accidentally got enabled, so the snapshot
> files could be corrupted ones.
> 
> Is shmem used a lot in libvirt, or is it even a default configuration?

No, but it's used with vhost-user applications; like dpdk.

Dave

> -- 
> Peter Xu
>
Andrey Gruzdev Dec. 1, 2020, 8:11 p.m. UTC | #18
On 01.12.2020 21:40, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 01.12.2020 13:53, Peter Krempa wrote:
>>> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
>>>> On 01.12.2020 10:08, Peter Krempa wrote:
>>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>
>>> [...]
>>>
>>>>> Note that in cases when qemu can't guarantee that the
>>>>> background_snapshot feature will work it should not advertise it. We
>>>>> need a way to check whether it's possible to use it, so we can replace
>>>>> the existing --live flag with it rather than adding a new one and
>>>>> shifting the problem of checking whether the feature works to the user.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> May be you are using hugetlbfs as memory backend?
>>>
>>> Not exactly hugepages, but I had:
>>>
>>>     <memoryBacking>
>>>       <access mode='shared'/>
>>>     </memoryBacking>
>>>
>>> which resulted into the following commandline to instantiate memory:
>>>
>>> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> When I've removed it I got:
>>>
>>> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> And the migration didn't fail in my quick test. I'll have a more
>>> detailed look later, thanks for the pointer.
>>>
>>
>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
>> missing pages but not for wr-protected..
> 
> For hugepages, you'd need kernel support - but also you'd want to make
> sure you write the whole hugepage at once.
> 
> For shared, there's a harder problem to ask; what happens if RAM is
> written by the other process - for postcopy, we get the other process
> to send us a userfaultfd that they have registered with their VM.
> 
> Dave
> 

Yep. May be problematic. But if used for vhost-user external backend - 
seems it should work.

>>>> I totally agree that we need somehow check that kernel and VM memory backend
>>>> support the feature before one can enable the capability.
>>>> Need to think about that..
>>>
>>> Definitely. Also note that memory backed by memory-backend-file will be
>>> more and more common, for cases such as virtiofs DAX sharing and
>>> similar.
>>>
>>
>> I see.. That needs support from kernel side, so far 'background-snapshots'
>> are incompatible with memory-backend-file sharing.
>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>
Andrey Gruzdev Dec. 1, 2020, 8:26 p.m. UTC | #19
On 01.12.2020 21:54, Peter Xu wrote:
> On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote:
>> On 01.12.2020 13:53, Peter Krempa wrote:
>>> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
>>>> On 01.12.2020 10:08, Peter Krempa wrote:
>>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>
>>> [...]
>>>
>>>>> Note that in cases when qemu can't guarantee that the
>>>>> background_snapshot feature will work it should not advertise it. We
>>>>> need a way to check whether it's possible to use it, so we can replace
>>>>> the existing --live flag with it rather than adding a new one and
>>>>> shifting the problem of checking whether the feature works to the user.
> 
> Would it be fine if libvirt just try the new way first anyways?  Since if it
> will fail, it'll fail right away on any unsupported memory types, then
> logically the libvirt user may not even notice we've retried.
> 
> Previously I thought it was enough, because so far the kernel does not have a
> specific flag showing whether such type of memory is supported.  But I don't
> know whether it would be non-trivial for libvirt to retry like that.
> 
> Another solution is to let qemu test the uffd ioctls right after QEMU memory
> setup, so we know whether background/live snapshot is supported or not with
> current memory backends.  We should need to try this for every ramblock because
> I think we can have different types across all the qemu ramblocks.
> 

I've already added some code to perform raw check on UFFD feature 
support on existing QEMU ram_list. That's fast, just trying REGISTER 
IOCTL with required features for each migratable RAMBlock, then cleanup.
So we can do it on migrate_set_capabilty. If some RAMBlock is not 
compatible (several memory slots with different backends) - we just give 
up enabling the capability.

>>>>>
>>>>
>>>> Hi,
>>>>
>>>> May be you are using hugetlbfs as memory backend?
>>>
>>> Not exactly hugepages, but I had:
>>>
>>>     <memoryBacking>
>>>       <access mode='shared'/>
>>>     </memoryBacking>
>>>
>>> which resulted into the following commandline to instantiate memory:
>>>
>>> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> When I've removed it I got:
>>>
>>> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> And the migration didn't fail in my quick test. I'll have a more
>>> detailed look later, thanks for the pointer.
>>>
>>
>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
>> missing pages but not for wr-protected..
> 
> Correct.  Btw, I'm working on both of them recently.  I have a testing kernel
> branch, but I don't think it should affect our qemu work, though, since qemu
> should do the same irrelevant of the memory type.  We can just test with
> anonymous memories, and as long as it works, it should work perfectly on all
> the rest of backends (maybe even for other file-backed memory, more below).
> 

Yes, I know. Nice improvement of userfaultfd feature, IMO.

>>
>>>> I totally agree that we need somehow check that kernel and VM memory backend
>>>> support the feature before one can enable the capability.
>>>> Need to think about that..
>>>
>>> Definitely. Also note that memory backed by memory-backend-file will be
>>> more and more common, for cases such as virtiofs DAX sharing and
>>> similar.
>>>
>>
>> I see.. That needs support from kernel side, so far 'background-snapshots'
>> are incompatible with memory-backend-file sharing.
> 
> Yes.  So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought
> about the rest yet.  Maybe... it's not hard to add uffd-wp for most of the
> file-backed memory?  Since afaict the kernel handles wr-protect in a quite
> straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the
> first to trap all of them.  I'm not sure whether Andrea has thought about that
> or even on how to spread the missing usage to more types of backends (maybe
> missing is more special in that it needs to consider page caches).  So I'm
> copying Andrea too just in case there's further input.
> 
> Thanks,
> 

Maybe be for file-backend it works from start, but many configurations 
come with shared hugetlbfs.. Now it's not compatible with background 
snapshotting. We can't support even anonymous hugetlbfs at the moment.
Andrey Gruzdev Dec. 1, 2020, 8:29 p.m. UTC | #20
On 01.12.2020 23:01, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote:
>>>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
>>>> missing pages but not for wr-protected..
>>>
>>> For hugepages, you'd need kernel support - but also you'd want to make
>>> sure you write the whole hugepage at once.
>>
>> Or we can do similar things by splitting the huge pages just like when we
>> migrate.
>>
>> I should have overlooked these facts when I replied previusly - we do need the
>> same logic, but also special care on these special memory types.
>>
>>>
>>> For shared, there's a harder problem to ask; what happens if RAM is
>>> written by the other process - for postcopy, we get the other process
>>> to send us a userfaultfd that they have registered with their VM.
>>
>> Good point... so we should need similar things too.
>>
>> Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu
>> background snapshots before we have prepared these utilities, just in case it
>> got run on some "future" kernels and accidentally got enabled, so the snapshot
>> files could be corrupted ones.
>>
>> Is shmem used a lot in libvirt, or is it even a default configuration?
> 
> No, but it's used with vhost-user applications; like dpdk.
> 
> Dave
> 
>> -- 
>> Peter Xu
>>

Yep.