Message ID | 20201126151734.743849-1-andrey.gruzdev@virtuozzo.com (mailing list archive) |
---|---|
Headers | show |
Series | UFFD write-tracking migration/snapshots | expand |
On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. > > Currently the only way to make (external) live VM snapshot is using existing > dirty page logging migration mechanism. The main problem is that it tends to > produce a lot of page duplicates while running VM goes on updating already > saved pages. That leads to the fact that vmstate image size is commonly several > times bigger then non-zero part of virtual machine's RSS. Time required to > converge RAM migration and the size of snapshot image severely depend on the > guest memory write rate, sometimes resulting in unacceptably long snapshot > creation time and huge image size. > > This series propose a way to solve the aforementioned problems. This is done > by using different RAM migration mechanism based on UFFD write protection > management introduced in v5.7 kernel. The migration strategy is to 'freeze' > guest RAM content using write-protection and iteratively release protection > for memory ranges that have already been saved to the migration stream. > At the same time we read in pending UFFD write fault events and save those > pages out-of-order with higher priority. This sounds amazing! Based on your description I assume that the final memory image contains state image from the beginning of the migration. This would make it desirable for the 'migrate' qmp command to be used as part of a 'transaction' qmp command so that we can do an instant disk and memory snapshot without any extraneous pausing of the VM. I'll have a look at using this mechanism in libvirt natively.
On 26.11.2020 18:47, Peter Krempa wrote: > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's >> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. >> >> Currently the only way to make (external) live VM snapshot is using existing >> dirty page logging migration mechanism. The main problem is that it tends to >> produce a lot of page duplicates while running VM goes on updating already >> saved pages. That leads to the fact that vmstate image size is commonly several >> times bigger then non-zero part of virtual machine's RSS. Time required to >> converge RAM migration and the size of snapshot image severely depend on the >> guest memory write rate, sometimes resulting in unacceptably long snapshot >> creation time and huge image size. >> >> This series propose a way to solve the aforementioned problems. This is done >> by using different RAM migration mechanism based on UFFD write protection >> management introduced in v5.7 kernel. The migration strategy is to 'freeze' >> guest RAM content using write-protection and iteratively release protection >> for memory ranges that have already been saved to the migration stream. >> At the same time we read in pending UFFD write fault events and save those >> pages out-of-order with higher priority. > > This sounds amazing! Based on your description I assume that the final > memory image contains state image from the beginning of the migration. > > This would make it desirable for the 'migrate' qmp command to be used as > part of a 'transaction' qmp command so that we can do an instant disk > and memory snapshot without any extraneous pausing of the VM. > > I'll have a look at using this mechanism in libvirt natively. > Correct, the final image contains state at the beginning of migration. So far, if I'm not missing something about libvirt, for external snapshot creation it performs a sequence like that: migrate(fd)->transaction(blockdev-snapshot-all)->cont. So, in case 'background-snapshot' capability is enabled, the sequence would change to: stop->transaction(blockdev-snapshot-all)->migrate(fd). With background snapshot migration it will finish with VM running so there's not need to 'cont' here. Thanks, I'd appreciate a lot if you look how to integrate the new mechanism to libvirt!
On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote: > On 26.11.2020 18:47, Peter Krempa wrote: > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > > > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. > > > > > > Currently the only way to make (external) live VM snapshot is using existing > > > dirty page logging migration mechanism. The main problem is that it tends to > > > produce a lot of page duplicates while running VM goes on updating already > > > saved pages. That leads to the fact that vmstate image size is commonly several > > > times bigger then non-zero part of virtual machine's RSS. Time required to > > > converge RAM migration and the size of snapshot image severely depend on the > > > guest memory write rate, sometimes resulting in unacceptably long snapshot > > > creation time and huge image size. > > > > > > This series propose a way to solve the aforementioned problems. This is done > > > by using different RAM migration mechanism based on UFFD write protection > > > management introduced in v5.7 kernel. The migration strategy is to 'freeze' > > > guest RAM content using write-protection and iteratively release protection > > > for memory ranges that have already been saved to the migration stream. > > > At the same time we read in pending UFFD write fault events and save those > > > pages out-of-order with higher priority. > > > > This sounds amazing! Based on your description I assume that the final > > memory image contains state image from the beginning of the migration. > > > > This would make it desirable for the 'migrate' qmp command to be used as > > part of a 'transaction' qmp command so that we can do an instant disk > > and memory snapshot without any extraneous pausing of the VM. > > > > I'll have a look at using this mechanism in libvirt natively. > > > > Correct, the final image contains state at the beginning of migration. > > So far, if I'm not missing something about libvirt, for external snapshot > creation it performs a sequence like that: > migrate(fd)->transaction(blockdev-snapshot-all)->cont. > > So, in case 'background-snapshot' capability is enabled, the sequence would > change to: > stop->transaction(blockdev-snapshot-all)->migrate(fd). > With background snapshot migration it will finish with VM running so there's > not need to 'cont' here. Yes, that's correct. The reason I've suggested that 'migrate' being part of a 'transaction' is that it would remove the need to stop it for the disk snapshot part.
On 27.11.2020 12:49, Peter Krempa wrote: > On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote: >> On 26.11.2020 18:47, Peter Krempa wrote: >>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's >>>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. >>>> >>>> Currently the only way to make (external) live VM snapshot is using existing >>>> dirty page logging migration mechanism. The main problem is that it tends to >>>> produce a lot of page duplicates while running VM goes on updating already >>>> saved pages. That leads to the fact that vmstate image size is commonly several >>>> times bigger then non-zero part of virtual machine's RSS. Time required to >>>> converge RAM migration and the size of snapshot image severely depend on the >>>> guest memory write rate, sometimes resulting in unacceptably long snapshot >>>> creation time and huge image size. >>>> >>>> This series propose a way to solve the aforementioned problems. This is done >>>> by using different RAM migration mechanism based on UFFD write protection >>>> management introduced in v5.7 kernel. The migration strategy is to 'freeze' >>>> guest RAM content using write-protection and iteratively release protection >>>> for memory ranges that have already been saved to the migration stream. >>>> At the same time we read in pending UFFD write fault events and save those >>>> pages out-of-order with higher priority. >>> >>> This sounds amazing! Based on your description I assume that the final >>> memory image contains state image from the beginning of the migration. >>> >>> This would make it desirable for the 'migrate' qmp command to be used as >>> part of a 'transaction' qmp command so that we can do an instant disk >>> and memory snapshot without any extraneous pausing of the VM. >>> >>> I'll have a look at using this mechanism in libvirt natively. >>> >> >> Correct, the final image contains state at the beginning of migration. >> >> So far, if I'm not missing something about libvirt, for external snapshot >> creation it performs a sequence like that: >> migrate(fd)->transaction(blockdev-snapshot-all)->cont. >> >> So, in case 'background-snapshot' capability is enabled, the sequence would >> change to: >> stop->transaction(blockdev-snapshot-all)->migrate(fd). >> With background snapshot migration it will finish with VM running so there's >> not need to 'cont' here. > > Yes, that's correct. > > The reason I've suggested that 'migrate' being part of a 'transaction' > is that it would remove the need to stop it for the disk snapshot part. > Hmm, I believe stopping VM for a short time is unavoidable to keep saved device state consistent with blockdev snapshot base.. May be I've missed something but it seems logical.
On Fri, Nov 27, 2020 at 01:00:48PM +0300, Andrey Gruzdev wrote: > On 27.11.2020 12:49, Peter Krempa wrote: > > On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote: > > > On 26.11.2020 18:47, Peter Krempa wrote: > > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > > > > > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. > > > > > > > > > > Currently the only way to make (external) live VM snapshot is using existing > > > > > dirty page logging migration mechanism. The main problem is that it tends to > > > > > produce a lot of page duplicates while running VM goes on updating already > > > > > saved pages. That leads to the fact that vmstate image size is commonly several > > > > > times bigger then non-zero part of virtual machine's RSS. Time required to > > > > > converge RAM migration and the size of snapshot image severely depend on the > > > > > guest memory write rate, sometimes resulting in unacceptably long snapshot > > > > > creation time and huge image size. > > > > > > > > > > This series propose a way to solve the aforementioned problems. This is done > > > > > by using different RAM migration mechanism based on UFFD write protection > > > > > management introduced in v5.7 kernel. The migration strategy is to 'freeze' > > > > > guest RAM content using write-protection and iteratively release protection > > > > > for memory ranges that have already been saved to the migration stream. > > > > > At the same time we read in pending UFFD write fault events and save those > > > > > pages out-of-order with higher priority. > > > > > > > > This sounds amazing! Based on your description I assume that the final > > > > memory image contains state image from the beginning of the migration. > > > > > > > > This would make it desirable for the 'migrate' qmp command to be used as > > > > part of a 'transaction' qmp command so that we can do an instant disk > > > > and memory snapshot without any extraneous pausing of the VM. > > > > > > > > I'll have a look at using this mechanism in libvirt natively. > > > > > > > > > > Correct, the final image contains state at the beginning of migration. > > > > > > So far, if I'm not missing something about libvirt, for external snapshot > > > creation it performs a sequence like that: > > > migrate(fd)->transaction(blockdev-snapshot-all)->cont. > > > > > > So, in case 'background-snapshot' capability is enabled, the sequence would > > > change to: > > > stop->transaction(blockdev-snapshot-all)->migrate(fd). > > > With background snapshot migration it will finish with VM running so there's > > > not need to 'cont' here. > > > > Yes, that's correct. > > > > The reason I've suggested that 'migrate' being part of a 'transaction' > > is that it would remove the need to stop it for the disk snapshot part. > > > > Hmm, I believe stopping VM for a short time is unavoidable to keep saved > device state consistent with blockdev snapshot base.. May be I've missed > something but it seems logical. I guess PeterK meant an explicit stop vm command, rather than the stop in this series that should be undetectable from the guest. It would be definitely great if PeterK could quickly follow up with this on libvirt. :) One overall comment (before go into details) is that please remember to start using "migration: " prefixes in patch subjects from the next version. Thanks,
On 27.11.2020 18:45, Peter Xu wrote: > On Fri, Nov 27, 2020 at 01:00:48PM +0300, Andrey Gruzdev wrote: >> On 27.11.2020 12:49, Peter Krempa wrote: >>> On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote: >>>> On 26.11.2020 18:47, Peter Krempa wrote: >>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's >>>>>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. >>>>>> >>>>>> Currently the only way to make (external) live VM snapshot is using existing >>>>>> dirty page logging migration mechanism. The main problem is that it tends to >>>>>> produce a lot of page duplicates while running VM goes on updating already >>>>>> saved pages. That leads to the fact that vmstate image size is commonly several >>>>>> times bigger then non-zero part of virtual machine's RSS. Time required to >>>>>> converge RAM migration and the size of snapshot image severely depend on the >>>>>> guest memory write rate, sometimes resulting in unacceptably long snapshot >>>>>> creation time and huge image size. >>>>>> >>>>>> This series propose a way to solve the aforementioned problems. This is done >>>>>> by using different RAM migration mechanism based on UFFD write protection >>>>>> management introduced in v5.7 kernel. The migration strategy is to 'freeze' >>>>>> guest RAM content using write-protection and iteratively release protection >>>>>> for memory ranges that have already been saved to the migration stream. >>>>>> At the same time we read in pending UFFD write fault events and save those >>>>>> pages out-of-order with higher priority. >>>>> >>>>> This sounds amazing! Based on your description I assume that the final >>>>> memory image contains state image from the beginning of the migration. >>>>> >>>>> This would make it desirable for the 'migrate' qmp command to be used as >>>>> part of a 'transaction' qmp command so that we can do an instant disk >>>>> and memory snapshot without any extraneous pausing of the VM. >>>>> >>>>> I'll have a look at using this mechanism in libvirt natively. >>>>> >>>> >>>> Correct, the final image contains state at the beginning of migration. >>>> >>>> So far, if I'm not missing something about libvirt, for external snapshot >>>> creation it performs a sequence like that: >>>> migrate(fd)->transaction(blockdev-snapshot-all)->cont. >>>> >>>> So, in case 'background-snapshot' capability is enabled, the sequence would >>>> change to: >>>> stop->transaction(blockdev-snapshot-all)->migrate(fd). >>>> With background snapshot migration it will finish with VM running so there's >>>> not need to 'cont' here. >>> >>> Yes, that's correct. >>> >>> The reason I've suggested that 'migrate' being part of a 'transaction' >>> is that it would remove the need to stop it for the disk snapshot part. >>> >> >> Hmm, I believe stopping VM for a short time is unavoidable to keep saved >> device state consistent with blockdev snapshot base.. May be I've missed >> something but it seems logical. > > I guess PeterK meant an explicit stop vm command, rather than the stop in this > series that should be undetectable from the guest. It would be definitely > great if PeterK could quickly follow up with this on libvirt. :) > Sorry if I misunderstood, sure it would be great if PeterK is really interested to follow up with this feature! > One overall comment (before go into details) is that please remember to start > using "migration: " prefixes in patch subjects from the next version. > Yes, I really missed it everywhere :) > Thanks, >
On Thu, Nov 26, 2020 at 06:17:28PM +0300, Andrey Gruzdev wrote: > Changes v3->v4: > > * 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'. > * 2. Use array of incompatible caps to replace bulky 'if' constructs. > * 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c'). > * 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor > * won't cleanup PTEs anyhow, it will release registration ranges, wait > * queues etc. but won't cleanup process MM context on MMU level. > * 5. Allow to enable 'background-snapshot' capability on Linux-only hosts. > * 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite. > * 7. Removed 'wt_' from RAMState struct. > * 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD > * wr-fault events in get_queued_page(), use ram_save_host_page_pre(), > * ram_save_host_page_post() notifiers around ram_save_host_page() > * instead of bulky inline write-unprotect code. One thing I mentioned previously but it seems still got lost is that we don't need dirty tracking for live snapshot. A few pointers for reference: memory_global_dirty_log_start() migration_bitmap_sync_precopy() memory_region_clear_dirty_bitmap() ... These should not be needed. But this can also be done on top. Thanks,
On 28.11.2020 01:04, Peter Xu wrote: > On Thu, Nov 26, 2020 at 06:17:28PM +0300, Andrey Gruzdev wrote: >> Changes v3->v4: >> >> * 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'. >> * 2. Use array of incompatible caps to replace bulky 'if' constructs. >> * 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c'). >> * 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor >> * won't cleanup PTEs anyhow, it will release registration ranges, wait >> * queues etc. but won't cleanup process MM context on MMU level. >> * 5. Allow to enable 'background-snapshot' capability on Linux-only hosts. >> * 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite. >> * 7. Removed 'wt_' from RAMState struct. >> * 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD >> * wr-fault events in get_queued_page(), use ram_save_host_page_pre(), >> * ram_save_host_page_post() notifiers around ram_save_host_page() >> * instead of bulky inline write-unprotect code. > > One thing I mentioned previously but it seems still got lost is that we don't > need dirty tracking for live snapshot. > > A few pointers for reference: > > memory_global_dirty_log_start() > migration_bitmap_sync_precopy() > memory_region_clear_dirty_bitmap() > ... > > These should not be needed. But this can also be done on top. > > Thanks, > Yes, all off this is really not needed. At least it seems that enabled dirty page logging in KVM makes not too much overhead and migration_bitmap_sync_precopy() is called only once. Let's do it on top.
On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. Hi, I gave this a try when attempting to implement the libvirt code for this feature. I've ran into a problem of migration failing right away. The VM's cpus were running at that point. QEMU logged the following to stdout/err: 2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: start=7f2007fff000 len=33554432000 mode=2 errno=22 2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() failed: restoring initial memory state 2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: start=7f2007fff000 len=33554432000 mode=0 errno=2 2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() failed: start=7f2007fff000 len=33554432000 errno=22 The migration was started by the following QMP conversation: QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"} QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-14"} QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"} QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-15"} QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"} QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0 QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-16"} QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"} QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": "setup"}} QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-17"} QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": {"pass": 1}} QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": "active"}} QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966488}, "event": "STOP"} QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": "failed"}} QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"query-migrate","id":"libvirt-18"} QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": "failed"}, "id": "libvirt-18"} qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly failed $ uname -r 5.8.18-300.fc33.x86_64 created by libvirt with the following patchset applied: https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot Start the snapshot via: virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no --diskspec sda,snapshot=no --no-metadata upstream Note you can omit --diskspec if you have a diskless VM. The patches are VERY work in progress as I need to figure out the proper sequencing to ensure a consistent snapshot. Note that in cases when qemu can't guarantee that the background_snapshot feature will work it should not advertise it. We need a way to check whether it's possible to use it, so we can replace the existing --live flag with it rather than adding a new one and shifting the problem of checking whether the feature works to the user.
On 01.12.2020 10:08, Peter Krempa wrote: > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's >> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. > > Hi, > > I gave this a try when attempting to implement the libvirt code for this > feature. I've ran into a problem of migration failing right away. The > VM's cpus were running at that point. > > QEMU logged the following to stdout/err: > > 2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: start=7f2007fff000 len=33554432000 mode=2 errno=22 > 2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() failed: restoring initial memory state > 2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: start=7f2007fff000 len=33554432000 mode=0 errno=2 > 2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() failed: start=7f2007fff000 len=33554432000 errno=22 > > > The migration was started by the following QMP conversation: > > > QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"} > > QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-14"} > > QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"} > > QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-15"} > > QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"} > > QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0 > QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-16"} > > QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"} > > QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": "setup"}} > QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-17"} > QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": {"pass": 1}} > QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": "active"}} > QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966488}, "event": "STOP"} > QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": "failed"}} > > QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"query-migrate","id":"libvirt-18"} > > QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": "failed"}, "id": "libvirt-18"} > qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly failed > > $ uname -r > 5.8.18-300.fc33.x86_64 > > > created by libvirt with the following patchset applied: > > https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot > > git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot > > Start the snapshot via: > > virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no --diskspec sda,snapshot=no --no-metadata upstream > > Note you can omit --diskspec if you have a diskless VM. > > The patches are VERY work in progress as I need to figure out the proper > sequencing to ensure a consistent snapshot. > > Note that in cases when qemu can't guarantee that the > background_snapshot feature will work it should not advertise it. We > need a way to check whether it's possible to use it, so we can replace > the existing --live flag with it rather than adding a new one and > shifting the problem of checking whether the feature works to the user. > Hi, May be you are using hugetlbfs as memory backend? I totally agree that we need somehow check that kernel and VM memory backend support the feature before one can enable the capability. Need to think about that.. Thanks,
On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: > On 01.12.2020 10:08, Peter Krempa wrote: > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's [...] > > Note that in cases when qemu can't guarantee that the > > background_snapshot feature will work it should not advertise it. We > > need a way to check whether it's possible to use it, so we can replace > > the existing --live flag with it rather than adding a new one and > > shifting the problem of checking whether the feature works to the user. > > > > Hi, > > May be you are using hugetlbfs as memory backend? Not exactly hugepages, but I had: <memoryBacking> <access mode='shared'/> </memoryBacking> which resulted into the following commandline to instantiate memory: -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ When I've removed it I got: -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ And the migration didn't fail in my quick test. I'll have a more detailed look later, thanks for the pointer. > I totally agree that we need somehow check that kernel and VM memory backend > support the feature before one can enable the capability. > Need to think about that.. Definitely. Also note that memory backed by memory-backend-file will be more and more common, for cases such as virtiofs DAX sharing and similar.
On 01.12.2020 13:53, Peter Krempa wrote: > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: >> On 01.12.2020 10:08, Peter Krempa wrote: >>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > > [...] > >>> Note that in cases when qemu can't guarantee that the >>> background_snapshot feature will work it should not advertise it. We >>> need a way to check whether it's possible to use it, so we can replace >>> the existing --live flag with it rather than adding a new one and >>> shifting the problem of checking whether the feature works to the user. >>> >> >> Hi, >> >> May be you are using hugetlbfs as memory backend? > > Not exactly hugepages, but I had: > > <memoryBacking> > <access mode='shared'/> > </memoryBacking> > > which resulted into the following commandline to instantiate memory: > > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ > > When I've removed it I got: > > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ > > And the migration didn't fail in my quick test. I'll have a more > detailed look later, thanks for the pointer. > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for missing pages but not for wr-protected.. >> I totally agree that we need somehow check that kernel and VM memory backend >> support the feature before one can enable the capability. >> Need to think about that.. > > Definitely. Also note that memory backed by memory-backend-file will be > more and more common, for cases such as virtiofs DAX sharing and > similar. > I see.. That needs support from kernel side, so far 'background-snapshots' are incompatible with memory-backend-file sharing.
* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: > On 01.12.2020 13:53, Peter Krempa wrote: > > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: > > > On 01.12.2020 10:08, Peter Krempa wrote: > > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > > > > [...] > > > > > > Note that in cases when qemu can't guarantee that the > > > > background_snapshot feature will work it should not advertise it. We > > > > need a way to check whether it's possible to use it, so we can replace > > > > the existing --live flag with it rather than adding a new one and > > > > shifting the problem of checking whether the feature works to the user. > > > > > > > > > > Hi, > > > > > > May be you are using hugetlbfs as memory backend? > > > > Not exactly hugepages, but I had: > > > > <memoryBacking> > > <access mode='shared'/> > > </memoryBacking> > > > > which resulted into the following commandline to instantiate memory: > > > > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ > > > > When I've removed it I got: > > > > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ > > > > And the migration didn't fail in my quick test. I'll have a more > > detailed look later, thanks for the pointer. > > > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for > missing pages but not for wr-protected.. For hugepages, you'd need kernel support - but also you'd want to make sure you write the whole hugepage at once. For shared, there's a harder problem to ask; what happens if RAM is written by the other process - for postcopy, we get the other process to send us a userfaultfd that they have registered with their VM. Dave > > > I totally agree that we need somehow check that kernel and VM memory backend > > > support the feature before one can enable the capability. > > > Need to think about that.. > > > > Definitely. Also note that memory backed by memory-backend-file will be > > more and more common, for cases such as virtiofs DAX sharing and > > similar. > > > > I see.. That needs support from kernel side, so far 'background-snapshots' > are incompatible with memory-backend-file sharing. > > -- > Andrey Gruzdev, Principal Engineer > Virtuozzo GmbH +7-903-247-6397 > virtuzzo.com >
On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote: > On 01.12.2020 13:53, Peter Krempa wrote: > > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: > > > On 01.12.2020 10:08, Peter Krempa wrote: > > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > > > > [...] > > > > > > Note that in cases when qemu can't guarantee that the > > > > background_snapshot feature will work it should not advertise it. We > > > > need a way to check whether it's possible to use it, so we can replace > > > > the existing --live flag with it rather than adding a new one and > > > > shifting the problem of checking whether the feature works to the user. Would it be fine if libvirt just try the new way first anyways? Since if it will fail, it'll fail right away on any unsupported memory types, then logically the libvirt user may not even notice we've retried. Previously I thought it was enough, because so far the kernel does not have a specific flag showing whether such type of memory is supported. But I don't know whether it would be non-trivial for libvirt to retry like that. Another solution is to let qemu test the uffd ioctls right after QEMU memory setup, so we know whether background/live snapshot is supported or not with current memory backends. We should need to try this for every ramblock because I think we can have different types across all the qemu ramblocks. > > > > > > > > > > Hi, > > > > > > May be you are using hugetlbfs as memory backend? > > > > Not exactly hugepages, but I had: > > > > <memoryBacking> > > <access mode='shared'/> > > </memoryBacking> > > > > which resulted into the following commandline to instantiate memory: > > > > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ > > > > When I've removed it I got: > > > > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ > > > > And the migration didn't fail in my quick test. I'll have a more > > detailed look later, thanks for the pointer. > > > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for > missing pages but not for wr-protected.. Correct. Btw, I'm working on both of them recently. I have a testing kernel branch, but I don't think it should affect our qemu work, though, since qemu should do the same irrelevant of the memory type. We can just test with anonymous memories, and as long as it works, it should work perfectly on all the rest of backends (maybe even for other file-backed memory, more below). > > > > I totally agree that we need somehow check that kernel and VM memory backend > > > support the feature before one can enable the capability. > > > Need to think about that.. > > > > Definitely. Also note that memory backed by memory-backend-file will be > > more and more common, for cases such as virtiofs DAX sharing and > > similar. > > > > I see.. That needs support from kernel side, so far 'background-snapshots' > are incompatible with memory-backend-file sharing. Yes. So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought about the rest yet. Maybe... it's not hard to add uffd-wp for most of the file-backed memory? Since afaict the kernel handles wr-protect in a quite straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the first to trap all of them. I'm not sure whether Andrea has thought about that or even on how to spread the missing usage to more types of backends (maybe missing is more special in that it needs to consider page caches). So I'm copying Andrea too just in case there's further input. Thanks,
On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote: > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for > > missing pages but not for wr-protected.. > > For hugepages, you'd need kernel support - but also you'd want to make > sure you write the whole hugepage at once. Or we can do similar things by splitting the huge pages just like when we migrate. I should have overlooked these facts when I replied previusly - we do need the same logic, but also special care on these special memory types. > > For shared, there's a harder problem to ask; what happens if RAM is > written by the other process - for postcopy, we get the other process > to send us a userfaultfd that they have registered with their VM. Good point... so we should need similar things too. Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu background snapshots before we have prepared these utilities, just in case it got run on some "future" kernels and accidentally got enabled, so the snapshot files could be corrupted ones. Is shmem used a lot in libvirt, or is it even a default configuration?
* Peter Xu (peterx@redhat.com) wrote: > On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote: > > On 01.12.2020 13:53, Peter Krempa wrote: > > > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: > > > > On 01.12.2020 10:08, Peter Krempa wrote: > > > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: > > > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > > > > > > [...] > > > > > > > > Note that in cases when qemu can't guarantee that the > > > > > background_snapshot feature will work it should not advertise it. We > > > > > need a way to check whether it's possible to use it, so we can replace > > > > > the existing --live flag with it rather than adding a new one and > > > > > shifting the problem of checking whether the feature works to the user. > > Would it be fine if libvirt just try the new way first anyways? Since if it > will fail, it'll fail right away on any unsupported memory types, then > logically the libvirt user may not even notice we've retried. > > Previously I thought it was enough, because so far the kernel does not have a > specific flag showing whether such type of memory is supported. But I don't > know whether it would be non-trivial for libvirt to retry like that. > > Another solution is to let qemu test the uffd ioctls right after QEMU memory > setup, so we know whether background/live snapshot is supported or not with > current memory backends. We should need to try this for every ramblock because > I think we can have different types across all the qemu ramblocks. I don't think we actually do that for postcopy; we do some checks like checking if we have any hugepages, and if so checking for it's flags. But note that we do tie it into migrate_caps_check to fail if you try and set the capability. > > > > > > > > > > > > > > Hi, > > > > > > > > May be you are using hugetlbfs as memory backend? > > > > > > Not exactly hugepages, but I had: > > > > > > <memoryBacking> > > > <access mode='shared'/> > > > </memoryBacking> > > > > > > which resulted into the following commandline to instantiate memory: > > > > > > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ > > > > > > When I've removed it I got: > > > > > > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ > > > > > > And the migration didn't fail in my quick test. I'll have a more > > > detailed look later, thanks for the pointer. > > > > > > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for > > missing pages but not for wr-protected.. > > Correct. Btw, I'm working on both of them recently. I have a testing kernel > branch, but I don't think it should affect our qemu work, though, since qemu > should do the same irrelevant of the memory type. We can just test with > anonymous memories, and as long as it works, it should work perfectly on all > the rest of backends (maybe even for other file-backed memory, more below). > > > > > > > I totally agree that we need somehow check that kernel and VM memory backend > > > > support the feature before one can enable the capability. > > > > Need to think about that.. > > > > > > Definitely. Also note that memory backed by memory-backend-file will be > > > more and more common, for cases such as virtiofs DAX sharing and > > > similar. > > > > > > > I see.. That needs support from kernel side, so far 'background-snapshots' > > are incompatible with memory-backend-file sharing. > > Yes. So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought > about the rest yet. Maybe... it's not hard to add uffd-wp for most of the > file-backed memory? Since afaict the kernel handles wr-protect in a quite > straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the > first to trap all of them. I'm not sure whether Andrea has thought about that > or even on how to spread the missing usage to more types of backends (maybe > missing is more special in that it needs to consider page caches). So I'm > copying Andrea too just in case there's further input. Some would be good; we've got requests for it to work on pmem mmaped devices. You do have to be a little careful about semantics though; I'm not sure it's that big of a problem for the wp case, but for the absent case you need to worry about finding an equivalent of madvise or fallocate that cna punch a hole. Dave > Thanks, > > -- > Peter Xu >
* Peter Xu (peterx@redhat.com) wrote: > On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote: > > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for > > > missing pages but not for wr-protected.. > > > > For hugepages, you'd need kernel support - but also you'd want to make > > sure you write the whole hugepage at once. > > Or we can do similar things by splitting the huge pages just like when we > migrate. > > I should have overlooked these facts when I replied previusly - we do need the > same logic, but also special care on these special memory types. > > > > > For shared, there's a harder problem to ask; what happens if RAM is > > written by the other process - for postcopy, we get the other process > > to send us a userfaultfd that they have registered with their VM. > > Good point... so we should need similar things too. > > Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu > background snapshots before we have prepared these utilities, just in case it > got run on some "future" kernels and accidentally got enabled, so the snapshot > files could be corrupted ones. > > Is shmem used a lot in libvirt, or is it even a default configuration? No, but it's used with vhost-user applications; like dpdk. Dave > -- > Peter Xu >
On 01.12.2020 21:40, Dr. David Alan Gilbert wrote: > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote: >> On 01.12.2020 13:53, Peter Krempa wrote: >>> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: >>>> On 01.12.2020 10:08, Peter Krempa wrote: >>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's >>> >>> [...] >>> >>>>> Note that in cases when qemu can't guarantee that the >>>>> background_snapshot feature will work it should not advertise it. We >>>>> need a way to check whether it's possible to use it, so we can replace >>>>> the existing --live flag with it rather than adding a new one and >>>>> shifting the problem of checking whether the feature works to the user. >>>>> >>>> >>>> Hi, >>>> >>>> May be you are using hugetlbfs as memory backend? >>> >>> Not exactly hugepages, but I had: >>> >>> <memoryBacking> >>> <access mode='shared'/> >>> </memoryBacking> >>> >>> which resulted into the following commandline to instantiate memory: >>> >>> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ >>> >>> When I've removed it I got: >>> >>> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ >>> >>> And the migration didn't fail in my quick test. I'll have a more >>> detailed look later, thanks for the pointer. >>> >> >> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for >> missing pages but not for wr-protected.. > > For hugepages, you'd need kernel support - but also you'd want to make > sure you write the whole hugepage at once. > > For shared, there's a harder problem to ask; what happens if RAM is > written by the other process - for postcopy, we get the other process > to send us a userfaultfd that they have registered with their VM. > > Dave > Yep. May be problematic. But if used for vhost-user external backend - seems it should work. >>>> I totally agree that we need somehow check that kernel and VM memory backend >>>> support the feature before one can enable the capability. >>>> Need to think about that.. >>> >>> Definitely. Also note that memory backed by memory-backend-file will be >>> more and more common, for cases such as virtiofs DAX sharing and >>> similar. >>> >> >> I see.. That needs support from kernel side, so far 'background-snapshots' >> are incompatible with memory-backend-file sharing. >> >> -- >> Andrey Gruzdev, Principal Engineer >> Virtuozzo GmbH +7-903-247-6397 >> virtuzzo.com >>
On 01.12.2020 21:54, Peter Xu wrote: > On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote: >> On 01.12.2020 13:53, Peter Krempa wrote: >>> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote: >>>> On 01.12.2020 10:08, Peter Krempa wrote: >>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote: >>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's >>> >>> [...] >>> >>>>> Note that in cases when qemu can't guarantee that the >>>>> background_snapshot feature will work it should not advertise it. We >>>>> need a way to check whether it's possible to use it, so we can replace >>>>> the existing --live flag with it rather than adding a new one and >>>>> shifting the problem of checking whether the feature works to the user. > > Would it be fine if libvirt just try the new way first anyways? Since if it > will fail, it'll fail right away on any unsupported memory types, then > logically the libvirt user may not even notice we've retried. > > Previously I thought it was enough, because so far the kernel does not have a > specific flag showing whether such type of memory is supported. But I don't > know whether it would be non-trivial for libvirt to retry like that. > > Another solution is to let qemu test the uffd ioctls right after QEMU memory > setup, so we know whether background/live snapshot is supported or not with > current memory backends. We should need to try this for every ramblock because > I think we can have different types across all the qemu ramblocks. > I've already added some code to perform raw check on UFFD feature support on existing QEMU ram_list. That's fast, just trying REGISTER IOCTL with required features for each migratable RAMBlock, then cleanup. So we can do it on migrate_set_capabilty. If some RAMBlock is not compatible (several memory slots with different backends) - we just give up enabling the capability. >>>>> >>>> >>>> Hi, >>>> >>>> May be you are using hugetlbfs as memory backend? >>> >>> Not exactly hugepages, but I had: >>> >>> <memoryBacking> >>> <access mode='shared'/> >>> </memoryBacking> >>> >>> which resulted into the following commandline to instantiate memory: >>> >>> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \ >>> >>> When I've removed it I got: >>> >>> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \ >>> >>> And the migration didn't fail in my quick test. I'll have a more >>> detailed look later, thanks for the pointer. >>> >> >> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for >> missing pages but not for wr-protected.. > > Correct. Btw, I'm working on both of them recently. I have a testing kernel > branch, but I don't think it should affect our qemu work, though, since qemu > should do the same irrelevant of the memory type. We can just test with > anonymous memories, and as long as it works, it should work perfectly on all > the rest of backends (maybe even for other file-backed memory, more below). > Yes, I know. Nice improvement of userfaultfd feature, IMO. >> >>>> I totally agree that we need somehow check that kernel and VM memory backend >>>> support the feature before one can enable the capability. >>>> Need to think about that.. >>> >>> Definitely. Also note that memory backed by memory-backend-file will be >>> more and more common, for cases such as virtiofs DAX sharing and >>> similar. >>> >> >> I see.. That needs support from kernel side, so far 'background-snapshots' >> are incompatible with memory-backend-file sharing. > > Yes. So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought > about the rest yet. Maybe... it's not hard to add uffd-wp for most of the > file-backed memory? Since afaict the kernel handles wr-protect in a quite > straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the > first to trap all of them. I'm not sure whether Andrea has thought about that > or even on how to spread the missing usage to more types of backends (maybe > missing is more special in that it needs to consider page caches). So I'm > copying Andrea too just in case there's further input. > > Thanks, > Maybe be for file-backend it works from start, but many configurations come with shared hugetlbfs.. Now it's not compatible with background snapshotting. We can't support even anonymous hugetlbfs at the moment.
On 01.12.2020 23:01, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: >> On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote: >>>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for >>>> missing pages but not for wr-protected.. >>> >>> For hugepages, you'd need kernel support - but also you'd want to make >>> sure you write the whole hugepage at once. >> >> Or we can do similar things by splitting the huge pages just like when we >> migrate. >> >> I should have overlooked these facts when I replied previusly - we do need the >> same logic, but also special care on these special memory types. >> >>> >>> For shared, there's a harder problem to ask; what happens if RAM is >>> written by the other process - for postcopy, we get the other process >>> to send us a userfaultfd that they have registered with their VM. >> >> Good point... so we should need similar things too. >> >> Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu >> background snapshots before we have prepared these utilities, just in case it >> got run on some "future" kernels and accidentally got enabled, so the snapshot >> files could be corrupted ones. >> >> Is shmem used a lot in libvirt, or is it even a default configuration? > > No, but it's used with vhost-user applications; like dpdk. > > Dave > >> -- >> Peter Xu >> Yep.