Message ID | 20201126151734.743849-4-andrey.gruzdev@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | UFFD write-tracking migration/snapshots | expand |
On Thu, Nov 26, 2020 at 06:17:31PM +0300, Andrey Gruzdev wrote: > In this particular implementation the same single migration > thread is responsible for both normal linear dirty page > migration and procesing UFFD page fault events. > > Processing write faults includes reading UFFD file descriptor, > finding respective RAM block and saving faulting page to > the migration stream. After page has been saved, write protection > can be removed. Since asynchronous version of qemu_put_buffer() > is expected to be used to save pages, we also have to flush > migraion stream prior to un-protecting saved memory range. > > Write protection is being removed for any previously protected > memory chunk that has hit the migration stream. That's valid > for pages from linear page scan along with write fault pages. Thanks for working on this version, it looks much cleaner. > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> > --- > migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 147 insertions(+), 8 deletions(-) > > diff --git a/migration/ram.c b/migration/ram.c > index 3adfd1948d..bcdccdaef7 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset) > return block; > } > > +#ifdef CONFIG_LINUX > +/** > + * ram_find_block_by_host_address: find RAM block containing host page > + * > + * Returns pointer to RAMBlock if found, NULL otherwise > + * > + * @rs: current RAM state > + * @page_address: host page address > + */ > +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address) Reuse qemu_ram_block_from_host() somehow? > +{ > + RAMBlock *bs = rs->last_seen_block; > + > + do { > + if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <= > + ((hwaddr) bs->host + bs->max_length)) { > + return bs; > + } > + > + bs = QLIST_NEXT_RCU(bs, next); > + if (!bs) { > + /* Hit the end of the list */ > + bs = QLIST_FIRST_RCU(&ram_list.blocks); > + } > + } while (bs != rs->last_seen_block); > + > + return NULL; > +} > + > +/** > + * poll_fault_page: try to get next UFFD write fault page and, if pending fault > + * is found, return RAM block pointer and page offset > + * > + * Returns pointer to the RAMBlock containing faulting page, > + * NULL if no write faults are pending > + * > + * @rs: current RAM state > + * @offset: page offset from the beginning of the block > + */ > +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) > +{ > + struct uffd_msg uffd_msg; > + hwaddr page_address; > + RAMBlock *bs; > + int res; > + > + if (!migrate_background_snapshot()) { > + return NULL; > + } > + > + res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1); > + if (res <= 0) { > + return NULL; > + } > + > + page_address = uffd_msg.arg.pagefault.address; > + bs = ram_find_block_by_host_address(rs, page_address); > + if (!bs) { > + /* In case we couldn't find respective block, just unprotect faulting page. */ > + uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false); > + error_report("ram_find_block_by_host_address() failed: address=0x%0lx", > + page_address); Looks ok to error_report() instead of assert(), but I'll suggest drop the call to uffd_protect_memory() at least. The only reason to not use assert() is because we try our best to avoid crashing the vm, however I really doubt whether uffd_protect_memory() is the right thing to do even if it happens - we may at last try to unprotect some strange pages that we don't even know where it is... > + return NULL; > + } > + > + *offset = (ram_addr_t) (page_address - (hwaddr) bs->host); > + return bs; > +} > +#endif /* CONFIG_LINUX */ > + > /** > * get_queued_page: unqueue a page from the postcopy requests > * > @@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss) > > } while (block && !dirty); > > +#ifdef CONFIG_LINUX > + if (!block) { > + /* > + * Poll write faults too if background snapshot is enabled; that's > + * when we have vcpus got blocked by the write protected pages. > + */ > + block = poll_fault_page(rs, &offset); > + } > +#endif /* CONFIG_LINUX */ > + > if (block) { > /* > * As soon as we start servicing pages out of order, then we have > @@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, > return pages; > } > > +/** > + * ram_save_host_page_pre: ram_save_host_page() pre-notifier > + * > + * @rs: current RAM state > + * @pss: page-search-status structure > + * @opaque: pointer to receive opaque context value > + */ > +static inline > +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque) > +{ > + *(ram_addr_t *) opaque = pss->page; > +} > + > +/** > + * ram_save_host_page_post: ram_save_host_page() post-notifier > + * > + * @rs: current RAM state > + * @pss: page-search-status structure > + * @opaque: opaque context value > + * @res_override: pointer to the return value of ram_save_host_page(), > + * overwritten in case of an error > + */ > +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, > + void *opaque, int *res_override) > +{ > + /* Check if page is from UFFD-managed region. */ > + if (pss->block->flags & RAM_UF_WRITEPROTECT) { > +#ifdef CONFIG_LINUX > + ram_addr_t page_from = (ram_addr_t) opaque; > + hwaddr page_address = (hwaddr) pss->block->host + > + ((hwaddr) page_from << TARGET_PAGE_BITS); I feel like most new uses of hwaddr is not correct... As I also commented in the other patch. We should save a lot of castings if switched. > + hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS; > + int res; > + > + /* Flush async buffers before un-protect. */ > + qemu_fflush(rs->f); > + /* Un-protect memory range. */ > + res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false); > + /* We don't want to override existing error from ram_save_host_page(). */ > + if (res < 0 && *res_override >= 0) { > + *res_override = res; What is this used for? If res<0, I thought we should just fail the snapshot. Meanwhile, res_override points to "pages", and then it'll be rewrite to the errno returned by uffd_protect_memory(). Smells strange. Can this ever be triggered anyway? > + } > +#else > + /* Should never happen */ > + qemu_file_set_error(rs->f, -ENOSYS); > +#endif /* CONFIG_LINUX */ > + } > +} > + > /** > * ram_find_and_save_block: finds a dirty page and sends it to f > * > @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) > return pages; > } > > + if (!rs->last_seen_block) { > + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); Why setup the last seen block to be the first if null? > + } > + > pss.block = rs->last_seen_block; > pss.page = rs->last_page; > pss.complete_round = false; > > - if (!pss.block) { > - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); > - } > - > do { > again = true; > found = get_queued_page(rs, &pss); > @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) > } > > if (found) { > + void *opaque; > + > + ram_save_host_page_pre(rs, &pss, &opaque); > pages = ram_save_host_page(rs, &pss, last_stage); > + ram_save_host_page_post(rs, &pss, opaque, &pages); So the pre/post idea is kind of an overkill to me... How about we do the unprotect in ram_save_host_page() in the simple way, like: ram_save_host_page() start_addr = pss->page; do { ... (migrate pages) ... } while (...); if (background_snapshot_enabled()) { unprotect pages within start_addr to pss->page; } ... > } > } while (!pages && again); > > @@ -3864,9 +3997,12 @@ fail: > rs->uffdio_fd = -1; > return -1; > #else > + /* > + * Should never happen since we prohibit 'background-snapshot' > + * capability on non-Linux hosts. Yeah, yeah. So let's drop these irrelevant changes? :) > + */ > rs->uffdio_fd = -1; > - error_setg(&migrate_get_current()->error, > - "Background-snapshot not supported on non-Linux hosts"); > + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); > return -1; > #endif /* CONFIG_LINUX */ > } > @@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void) > uffd_close_fd(rs->uffdio_fd); > rs->uffdio_fd = -1; > #else > - error_setg(&migrate_get_current()->error, > - "Background-snapshot not supported on non-Linux hosts"); > + /* > + * Should never happen since we prohibit 'background-snapshot' > + * capability on non-Linux hosts. > + */ > + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); Same here. Thanks, > #endif /* CONFIG_LINUX */ > } > > -- > 2.25.1 >
On 28.11.2020 00:49, Peter Xu wrote: > On Thu, Nov 26, 2020 at 06:17:31PM +0300, Andrey Gruzdev wrote: >> In this particular implementation the same single migration >> thread is responsible for both normal linear dirty page >> migration and procesing UFFD page fault events. >> >> Processing write faults includes reading UFFD file descriptor, >> finding respective RAM block and saving faulting page to >> the migration stream. After page has been saved, write protection >> can be removed. Since asynchronous version of qemu_put_buffer() >> is expected to be used to save pages, we also have to flush >> migraion stream prior to un-protecting saved memory range. >> >> Write protection is being removed for any previously protected >> memory chunk that has hit the migration stream. That's valid >> for pages from linear page scan along with write fault pages. > > Thanks for working on this version, it looks much cleaner. > >> >> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> >> --- >> migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 147 insertions(+), 8 deletions(-) >> >> diff --git a/migration/ram.c b/migration/ram.c >> index 3adfd1948d..bcdccdaef7 100644 >> --- a/migration/ram.c >> +++ b/migration/ram.c >> @@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset) >> return block; >> } >> >> +#ifdef CONFIG_LINUX >> +/** >> + * ram_find_block_by_host_address: find RAM block containing host page >> + * >> + * Returns pointer to RAMBlock if found, NULL otherwise >> + * >> + * @rs: current RAM state >> + * @page_address: host page address >> + */ >> +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address) > > Reuse qemu_ram_block_from_host() somehow? > Seems not very suitable here, since we use rs->last_seen_block to restart search.. >> +{ >> + RAMBlock *bs = rs->last_seen_block; >> + >> + do { >> + if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <= >> + ((hwaddr) bs->host + bs->max_length)) { >> + return bs; >> + } >> + >> + bs = QLIST_NEXT_RCU(bs, next); >> + if (!bs) { >> + /* Hit the end of the list */ >> + bs = QLIST_FIRST_RCU(&ram_list.blocks); >> + } >> + } while (bs != rs->last_seen_block); >> + >> + return NULL; >> +} >> + >> +/** >> + * poll_fault_page: try to get next UFFD write fault page and, if pending fault >> + * is found, return RAM block pointer and page offset >> + * >> + * Returns pointer to the RAMBlock containing faulting page, >> + * NULL if no write faults are pending >> + * >> + * @rs: current RAM state >> + * @offset: page offset from the beginning of the block >> + */ >> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) >> +{ >> + struct uffd_msg uffd_msg; >> + hwaddr page_address; >> + RAMBlock *bs; >> + int res; >> + >> + if (!migrate_background_snapshot()) { >> + return NULL; >> + } >> + >> + res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1); >> + if (res <= 0) { >> + return NULL; >> + } >> + >> + page_address = uffd_msg.arg.pagefault.address; >> + bs = ram_find_block_by_host_address(rs, page_address); >> + if (!bs) { >> + /* In case we couldn't find respective block, just unprotect faulting page. */ >> + uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false); >> + error_report("ram_find_block_by_host_address() failed: address=0x%0lx", >> + page_address); > > Looks ok to error_report() instead of assert(), but I'll suggest drop the call > to uffd_protect_memory() at least. The only reason to not use assert() is > because we try our best to avoid crashing the vm, however I really doubt > whether uffd_protect_memory() is the right thing to do even if it happens - we > may at last try to unprotect some strange pages that we don't even know where > it is... > IMHO better to unprotect these strange pages then to leave them protected by UFFD.. To avoid getting VM completely in-operational. At least we know the page generated wr-fault, maybe due to incorrect write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if leave the page as is, VM would certainly lock. Hmm, I wonder about assert(). In QEMU it would do something in release builds? >> + return NULL; >> + } >> + >> + *offset = (ram_addr_t) (page_address - (hwaddr) bs->host); >> + return bs; >> +} >> +#endif /* CONFIG_LINUX */ >> + >> /** >> * get_queued_page: unqueue a page from the postcopy requests >> * >> @@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss) >> >> } while (block && !dirty); >> >> +#ifdef CONFIG_LINUX >> + if (!block) { >> + /* >> + * Poll write faults too if background snapshot is enabled; that's >> + * when we have vcpus got blocked by the write protected pages. >> + */ >> + block = poll_fault_page(rs, &offset); >> + } >> +#endif /* CONFIG_LINUX */ >> + >> if (block) { >> /* >> * As soon as we start servicing pages out of order, then we have >> @@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, >> return pages; >> } >> >> +/** >> + * ram_save_host_page_pre: ram_save_host_page() pre-notifier >> + * >> + * @rs: current RAM state >> + * @pss: page-search-status structure >> + * @opaque: pointer to receive opaque context value >> + */ >> +static inline >> +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque) >> +{ >> + *(ram_addr_t *) opaque = pss->page; >> +} >> + >> +/** >> + * ram_save_host_page_post: ram_save_host_page() post-notifier >> + * >> + * @rs: current RAM state >> + * @pss: page-search-status structure >> + * @opaque: opaque context value >> + * @res_override: pointer to the return value of ram_save_host_page(), >> + * overwritten in case of an error >> + */ >> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, >> + void *opaque, int *res_override) >> +{ >> + /* Check if page is from UFFD-managed region. */ >> + if (pss->block->flags & RAM_UF_WRITEPROTECT) { >> +#ifdef CONFIG_LINUX >> + ram_addr_t page_from = (ram_addr_t) opaque; >> + hwaddr page_address = (hwaddr) pss->block->host + >> + ((hwaddr) page_from << TARGET_PAGE_BITS); > > I feel like most new uses of hwaddr is not correct... As I also commented in > the other patch. We should save a lot of castings if switched. > Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as uintptr_t. I any case UFFD interface relies on u64 type. >> + hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS; >> + int res; >> + >> + /* Flush async buffers before un-protect. */ >> + qemu_fflush(rs->f); >> + /* Un-protect memory range. */ >> + res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false); >> + /* We don't want to override existing error from ram_save_host_page(). */ >> + if (res < 0 && *res_override >= 0) { >> + *res_override = res; > > What is this used for? If res<0, I thought we should just fail the snapshot. > > Meanwhile, res_override points to "pages", and then it'll be rewrite to the > errno returned by uffd_protect_memory(). Smells strange. > > Can this ever be triggered anyway? > Yes, since "pages" is also for error return, if negative. If we have a problem with un-protecting, promote the error to the loop in ram_find_and_save_block() so it exits early ("pages" guaranteed to be non-zero). In outer routines retcode would go to qemu_set_file_error(). >> + } >> +#else >> + /* Should never happen */ >> + qemu_file_set_error(rs->f, -ENOSYS); >> +#endif /* CONFIG_LINUX */ >> + } >> +} >> + >> /** >> * ram_find_and_save_block: finds a dirty page and sends it to f >> * >> @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) >> return pages; >> } >> >> + if (!rs->last_seen_block) { >> + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); > > Why setup the last seen block to be the first if null? > Because ram_find_block_by_host_address() relies on rs->last_seen_block. Pss is not passed to that routine. >> + } >> + >> pss.block = rs->last_seen_block; >> pss.page = rs->last_page; >> pss.complete_round = false; >> >> - if (!pss.block) { >> - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); >> - } >> - >> do { >> again = true; >> found = get_queued_page(rs, &pss); >> @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) >> } >> >> if (found) { >> + void *opaque; >> + >> + ram_save_host_page_pre(rs, &pss, &opaque); >> pages = ram_save_host_page(rs, &pss, last_stage); >> + ram_save_host_page_post(rs, &pss, opaque, &pages); > > So the pre/post idea is kind of an overkill to me... > > How about we do the unprotect in ram_save_host_page() in the simple way, like: > > ram_save_host_page() > start_addr = pss->page; > do { > ... > (migrate pages) > ... > } while (...); > if (background_snapshot_enabled()) { > unprotect pages within start_addr to pss->page; > } > ... > Personally I prefer not to modify existing code. May be adding to simple calls would finally look cleaner? >> } >> } while (!pages && again); >> >> @@ -3864,9 +3997,12 @@ fail: >> rs->uffdio_fd = -1; >> return -1; >> #else >> + /* >> + * Should never happen since we prohibit 'background-snapshot' >> + * capability on non-Linux hosts. > > Yeah, yeah. So let's drop these irrelevant changes? :) > Yes. >> + */ >> rs->uffdio_fd = -1; >> - error_setg(&migrate_get_current()->error, >> - "Background-snapshot not supported on non-Linux hosts"); >> + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); >> return -1; >> #endif /* CONFIG_LINUX */ >> } >> @@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void) >> uffd_close_fd(rs->uffdio_fd); >> rs->uffdio_fd = -1; >> #else >> - error_setg(&migrate_get_current()->error, >> - "Background-snapshot not supported on non-Linux hosts"); >> + /* >> + * Should never happen since we prohibit 'background-snapshot' >> + * capability on non-Linux hosts. >> + */ >> + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); > > Same here. > > Thanks, > >> #endif /* CONFIG_LINUX */ >> } >> >> -- >> 2.25.1 >> >
On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote: > > > +#ifdef CONFIG_LINUX > > > +/** > > > + * ram_find_block_by_host_address: find RAM block containing host page > > > + * > > > + * Returns pointer to RAMBlock if found, NULL otherwise > > > + * > > > + * @rs: current RAM state > > > + * @page_address: host page address > > > + */ > > > +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address) > > > > Reuse qemu_ram_block_from_host() somehow? > > > > Seems not very suitable here, since we use rs->last_seen_block to restart > search.. The search logic is actually still the same, it's just about which block to start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one). So an internal helper to pass in that information would be nice. Though that'll require rcu lock taken before hand to keep the ramblock ptr valid. No worry - we can keep this and work on top too, I think. [...] > > > +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) > > > +{ > > > + struct uffd_msg uffd_msg; > > > + hwaddr page_address; > > > + RAMBlock *bs; > > > + int res; > > > + > > > + if (!migrate_background_snapshot()) { > > > + return NULL; > > > + } > > > + > > > + res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1); > > > + if (res <= 0) { > > > + return NULL; > > > + } > > > + > > > + page_address = uffd_msg.arg.pagefault.address; > > > + bs = ram_find_block_by_host_address(rs, page_address); > > > + if (!bs) { > > > + /* In case we couldn't find respective block, just unprotect faulting page. */ > > > + uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false); > > > + error_report("ram_find_block_by_host_address() failed: address=0x%0lx", > > > + page_address); > > > > Looks ok to error_report() instead of assert(), but I'll suggest drop the call > > to uffd_protect_memory() at least. The only reason to not use assert() is > > because we try our best to avoid crashing the vm, however I really doubt > > whether uffd_protect_memory() is the right thing to do even if it happens - we > > may at last try to unprotect some strange pages that we don't even know where > > it is... > > > > IMHO better to unprotect these strange pages then to leave them protected by > UFFD.. To avoid getting VM completely in-operational. > At least we know the page generated wr-fault, maybe due to incorrect > write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if > leave the page as is, VM would certainly lock. Yes makes some senes too. However it's definitely a severe programming error, even if the VM is unblocked, it can be in a very weird state... Maybe we just assert? Then we can see how unlucky we'll be. :) > > Hmm, I wonder about assert(). In QEMU it would do something in release > builds? I guess yes, at least for now. Because osdep.h has this: /* * We have a lot of unaudited code that may fail in strange ways, or * even be a security risk during migration, if you disable assertions * at compile-time. You may comment out these safety checks if you * absolutely want to disable assertion overhead, but it is not * supported upstream so the risk is all yours. Meanwhile, please * submit patches to remove any side-effects inside an assertion, or * fixing error handling that should use Error instead of assert. */ #ifdef NDEBUG #error building with NDEBUG is not supported #endif #ifdef G_DISABLE_ASSERT #error building with G_DISABLE_ASSERT is not supported #endif [...] > > > +/** > > > + * ram_save_host_page_post: ram_save_host_page() post-notifier > > > + * > > > + * @rs: current RAM state > > > + * @pss: page-search-status structure > > > + * @opaque: opaque context value > > > + * @res_override: pointer to the return value of ram_save_host_page(), > > > + * overwritten in case of an error > > > + */ > > > +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, > > > + void *opaque, int *res_override) > > > +{ > > > + /* Check if page is from UFFD-managed region. */ > > > + if (pss->block->flags & RAM_UF_WRITEPROTECT) { > > > +#ifdef CONFIG_LINUX > > > + ram_addr_t page_from = (ram_addr_t) opaque; > > > + hwaddr page_address = (hwaddr) pss->block->host + > > > + ((hwaddr) page_from << TARGET_PAGE_BITS); > > > > I feel like most new uses of hwaddr is not correct... As I also commented in > > the other patch. We should save a lot of castings if switched. > > > > Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as > uintptr_t. I any case UFFD interface relies on u64 type. For example, page_from should be a host address, we can use unsigned long, or uint64_t, but ram_addr_t is not proper, which is only used in ramblock address space. I think it's fine that e.g. we use common types like uint64_t directly. > > > > + hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS; > > > + int res; > > > + > > > + /* Flush async buffers before un-protect. */ > > > + qemu_fflush(rs->f); > > > + /* Un-protect memory range. */ > > > + res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false); > > > + /* We don't want to override existing error from ram_save_host_page(). */ > > > + if (res < 0 && *res_override >= 0) { > > > + *res_override = res; > > > > What is this used for? If res<0, I thought we should just fail the snapshot. > > > > Meanwhile, res_override points to "pages", and then it'll be rewrite to the > > errno returned by uffd_protect_memory(). Smells strange. > > > > Can this ever be triggered anyway? > > > > Yes, since "pages" is also for error return, if negative. If we have a > problem with un-protecting, promote the error to the loop in > ram_find_and_save_block() so it exits early ("pages" guaranteed to be > non-zero). In outer routines retcode would go to qemu_set_file_error(). Ok I see your point. > > > > + } > > > +#else > > > + /* Should never happen */ > > > + qemu_file_set_error(rs->f, -ENOSYS); > > > +#endif /* CONFIG_LINUX */ > > > + } > > > +} > > > + > > > /** > > > * ram_find_and_save_block: finds a dirty page and sends it to f > > > * > > > @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) > > > return pages; > > > } > > > + if (!rs->last_seen_block) { > > > + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); > > > > Why setup the last seen block to be the first if null? > > > > Because ram_find_block_by_host_address() relies on rs->last_seen_block. > Pss is not passed to that routine. Ok, but it's odd - because that's not the real "last seen block". So now I'm rethinking maybe we could still reuse qemu_ram_block_from_host() by providing a new helper to do the search logic as mentioned above, which could take a pointer of RAMBlock as the 1st ramblock to search. Then ram_find_block_by_host_address() should pass in rs->last_seen_block (which could be NULL - then we'll start with the 1st ramblock), or ram_list.mru_block for qemu_ram_block_from_host(). That could be a standalone new patch, then we don't need this slightly strange change. What do you think? > > > > + } > > > + > > > pss.block = rs->last_seen_block; > > > pss.page = rs->last_page; > > > pss.complete_round = false; > > > - if (!pss.block) { > > > - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); > > > - } > > > - > > > do { > > > again = true; > > > found = get_queued_page(rs, &pss); > > > @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) > > > } > > > if (found) { > > > + void *opaque; > > > + > > > + ram_save_host_page_pre(rs, &pss, &opaque); > > > pages = ram_save_host_page(rs, &pss, last_stage); > > > + ram_save_host_page_post(rs, &pss, opaque, &pages); > > > > So the pre/post idea is kind of an overkill to me... > > > > How about we do the unprotect in ram_save_host_page() in the simple way, like: > > > > ram_save_host_page() > > start_addr = pss->page; > > do { > > ... > > (migrate pages) > > ... > > } while (...); > > if (background_snapshot_enabled()) { > > unprotect pages within start_addr to pss->page; > > } > > ... > > > > Personally I prefer not to modify existing code. May be adding to simple > calls would finally look cleaner? Let's wait a 3rd opinion from Juan/Dave or others. Thanks,
On 30.11.2020 19:32, Peter Xu wrote: > On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote: >>>> +#ifdef CONFIG_LINUX >>>> +/** >>>> + * ram_find_block_by_host_address: find RAM block containing host page >>>> + * >>>> + * Returns pointer to RAMBlock if found, NULL otherwise >>>> + * >>>> + * @rs: current RAM state >>>> + * @page_address: host page address >>>> + */ >>>> +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address) >>> >>> Reuse qemu_ram_block_from_host() somehow? >>> >> >> Seems not very suitable here, since we use rs->last_seen_block to restart >> search.. > > The search logic is actually still the same, it's just about which block to > start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one). So > an internal helper to pass in that information would be nice. Though that'll > require rcu lock taken before hand to keep the ramblock ptr valid. > > No worry - we can keep this and work on top too, I think. > > [...] > >>>> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) >>>> +{ >>>> + struct uffd_msg uffd_msg; >>>> + hwaddr page_address; >>>> + RAMBlock *bs; >>>> + int res; >>>> + >>>> + if (!migrate_background_snapshot()) { >>>> + return NULL; >>>> + } >>>> + >>>> + res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1); >>>> + if (res <= 0) { >>>> + return NULL; >>>> + } >>>> + >>>> + page_address = uffd_msg.arg.pagefault.address; >>>> + bs = ram_find_block_by_host_address(rs, page_address); >>>> + if (!bs) { >>>> + /* In case we couldn't find respective block, just unprotect faulting page. */ >>>> + uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false); >>>> + error_report("ram_find_block_by_host_address() failed: address=0x%0lx", >>>> + page_address); >>> >>> Looks ok to error_report() instead of assert(), but I'll suggest drop the call >>> to uffd_protect_memory() at least. The only reason to not use assert() is >>> because we try our best to avoid crashing the vm, however I really doubt >>> whether uffd_protect_memory() is the right thing to do even if it happens - we >>> may at last try to unprotect some strange pages that we don't even know where >>> it is... >>> >> >> IMHO better to unprotect these strange pages then to leave them protected by >> UFFD.. To avoid getting VM completely in-operational. >> At least we know the page generated wr-fault, maybe due to incorrect >> write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if >> leave the page as is, VM would certainly lock. > > Yes makes some senes too. However it's definitely a severe programming error, > even if the VM is unblocked, it can be in a very weird state... > > Maybe we just assert? Then we can see how unlucky we'll be. :) > Yeah, seems assert is a right thing here :) >> >> Hmm, I wonder about assert(). In QEMU it would do something in release >> builds? > > I guess yes, at least for now. Because osdep.h has this: > > /* > * We have a lot of unaudited code that may fail in strange ways, or > * even be a security risk during migration, if you disable assertions > * at compile-time. You may comment out these safety checks if you > * absolutely want to disable assertion overhead, but it is not > * supported upstream so the risk is all yours. Meanwhile, please > * submit patches to remove any side-effects inside an assertion, or > * fixing error handling that should use Error instead of assert. > */ > #ifdef NDEBUG > #error building with NDEBUG is not supported > #endif > #ifdef G_DISABLE_ASSERT > #error building with G_DISABLE_ASSERT is not supported > #endif > > [...] > Ooops, didn't know that, thanks for info! >>>> +/** >>>> + * ram_save_host_page_post: ram_save_host_page() post-notifier >>>> + * >>>> + * @rs: current RAM state >>>> + * @pss: page-search-status structure >>>> + * @opaque: opaque context value >>>> + * @res_override: pointer to the return value of ram_save_host_page(), >>>> + * overwritten in case of an error >>>> + */ >>>> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, >>>> + void *opaque, int *res_override) >>>> +{ >>>> + /* Check if page is from UFFD-managed region. */ >>>> + if (pss->block->flags & RAM_UF_WRITEPROTECT) { >>>> +#ifdef CONFIG_LINUX >>>> + ram_addr_t page_from = (ram_addr_t) opaque; >>>> + hwaddr page_address = (hwaddr) pss->block->host + >>>> + ((hwaddr) page_from << TARGET_PAGE_BITS); >>> >>> I feel like most new uses of hwaddr is not correct... As I also commented in >>> the other patch. We should save a lot of castings if switched. >>> >> >> Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as >> uintptr_t. I any case UFFD interface relies on u64 type. > > For example, page_from should be a host address, we can use unsigned long, or > uint64_t, but ram_addr_t is not proper, which is only used in ramblock address > space. > > I think it's fine that e.g. we use common types like uint64_t directly. > Now I also think we'd better use common types - mostly uint64_t directly, not to confuse anybody with that specific type descriptions. >> >>>> + hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS; >>>> + int res; >>>> + >>>> + /* Flush async buffers before un-protect. */ >>>> + qemu_fflush(rs->f); >>>> + /* Un-protect memory range. */ >>>> + res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false); >>>> + /* We don't want to override existing error from ram_save_host_page(). */ >>>> + if (res < 0 && *res_override >= 0) { >>>> + *res_override = res; >>> >>> What is this used for? If res<0, I thought we should just fail the snapshot. >>> >>> Meanwhile, res_override points to "pages", and then it'll be rewrite to the >>> errno returned by uffd_protect_memory(). Smells strange. >>> >>> Can this ever be triggered anyway? >>> >> >> Yes, since "pages" is also for error return, if negative. If we have a >> problem with un-protecting, promote the error to the loop in >> ram_find_and_save_block() so it exits early ("pages" guaranteed to be >> non-zero). In outer routines retcode would go to qemu_set_file_error(). > > Ok I see your point. > >> >>>> + } >>>> +#else >>>> + /* Should never happen */ >>>> + qemu_file_set_error(rs->f, -ENOSYS); >>>> +#endif /* CONFIG_LINUX */ >>>> + } >>>> +} >>>> + >>>> /** >>>> * ram_find_and_save_block: finds a dirty page and sends it to f >>>> * >>>> @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) >>>> return pages; >>>> } >>>> + if (!rs->last_seen_block) { >>>> + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); >>> >>> Why setup the last seen block to be the first if null? >>> >> >> Because ram_find_block_by_host_address() relies on rs->last_seen_block. >> Pss is not passed to that routine. > > Ok, but it's odd - because that's not the real "last seen block". > > So now I'm rethinking maybe we could still reuse qemu_ram_block_from_host() by > providing a new helper to do the search logic as mentioned above, which could > take a pointer of RAMBlock as the 1st ramblock to search. Then > ram_find_block_by_host_address() should pass in rs->last_seen_block (which > could be NULL - then we'll start with the 1st ramblock), or ram_list.mru_block > for qemu_ram_block_from_host(). That could be a standalone new patch, then we > don't need this slightly strange change. What do you think? > Mm, need to think about that.. >> >>>> + } >>>> + >>>> pss.block = rs->last_seen_block; >>>> pss.page = rs->last_page; >>>> pss.complete_round = false; >>>> - if (!pss.block) { >>>> - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); >>>> - } >>>> - >>>> do { >>>> again = true; >>>> found = get_queued_page(rs, &pss); >>>> @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) >>>> } >>>> if (found) { >>>> + void *opaque; >>>> + >>>> + ram_save_host_page_pre(rs, &pss, &opaque); >>>> pages = ram_save_host_page(rs, &pss, last_stage); >>>> + ram_save_host_page_post(rs, &pss, opaque, &pages); >>> >>> So the pre/post idea is kind of an overkill to me... >>> >>> How about we do the unprotect in ram_save_host_page() in the simple way, like: >>> >>> ram_save_host_page() >>> start_addr = pss->page; >>> do { >>> ... >>> (migrate pages) >>> ... >>> } while (...); >>> if (background_snapshot_enabled()) { >>> unprotect pages within start_addr to pss->page; >>> } >>> ... >>> >> >> Personally I prefer not to modify existing code. May be adding to simple >> calls would finally look cleaner? > > Let's wait a 3rd opinion from Juan/Dave or others. > > Thanks, > Agree. Thanks,
diff --git a/migration/ram.c b/migration/ram.c index 3adfd1948d..bcdccdaef7 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset) return block; } +#ifdef CONFIG_LINUX +/** + * ram_find_block_by_host_address: find RAM block containing host page + * + * Returns pointer to RAMBlock if found, NULL otherwise + * + * @rs: current RAM state + * @page_address: host page address + */ +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address) +{ + RAMBlock *bs = rs->last_seen_block; + + do { + if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <= + ((hwaddr) bs->host + bs->max_length)) { + return bs; + } + + bs = QLIST_NEXT_RCU(bs, next); + if (!bs) { + /* Hit the end of the list */ + bs = QLIST_FIRST_RCU(&ram_list.blocks); + } + } while (bs != rs->last_seen_block); + + return NULL; +} + +/** + * poll_fault_page: try to get next UFFD write fault page and, if pending fault + * is found, return RAM block pointer and page offset + * + * Returns pointer to the RAMBlock containing faulting page, + * NULL if no write faults are pending + * + * @rs: current RAM state + * @offset: page offset from the beginning of the block + */ +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset) +{ + struct uffd_msg uffd_msg; + hwaddr page_address; + RAMBlock *bs; + int res; + + if (!migrate_background_snapshot()) { + return NULL; + } + + res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1); + if (res <= 0) { + return NULL; + } + + page_address = uffd_msg.arg.pagefault.address; + bs = ram_find_block_by_host_address(rs, page_address); + if (!bs) { + /* In case we couldn't find respective block, just unprotect faulting page. */ + uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false); + error_report("ram_find_block_by_host_address() failed: address=0x%0lx", + page_address); + return NULL; + } + + *offset = (ram_addr_t) (page_address - (hwaddr) bs->host); + return bs; +} +#endif /* CONFIG_LINUX */ + /** * get_queued_page: unqueue a page from the postcopy requests * @@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss) } while (block && !dirty); +#ifdef CONFIG_LINUX + if (!block) { + /* + * Poll write faults too if background snapshot is enabled; that's + * when we have vcpus got blocked by the write protected pages. + */ + block = poll_fault_page(rs, &offset); + } +#endif /* CONFIG_LINUX */ + if (block) { /* * As soon as we start servicing pages out of order, then we have @@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss, return pages; } +/** + * ram_save_host_page_pre: ram_save_host_page() pre-notifier + * + * @rs: current RAM state + * @pss: page-search-status structure + * @opaque: pointer to receive opaque context value + */ +static inline +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque) +{ + *(ram_addr_t *) opaque = pss->page; +} + +/** + * ram_save_host_page_post: ram_save_host_page() post-notifier + * + * @rs: current RAM state + * @pss: page-search-status structure + * @opaque: opaque context value + * @res_override: pointer to the return value of ram_save_host_page(), + * overwritten in case of an error + */ +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss, + void *opaque, int *res_override) +{ + /* Check if page is from UFFD-managed region. */ + if (pss->block->flags & RAM_UF_WRITEPROTECT) { +#ifdef CONFIG_LINUX + ram_addr_t page_from = (ram_addr_t) opaque; + hwaddr page_address = (hwaddr) pss->block->host + + ((hwaddr) page_from << TARGET_PAGE_BITS); + hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS; + int res; + + /* Flush async buffers before un-protect. */ + qemu_fflush(rs->f); + /* Un-protect memory range. */ + res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false); + /* We don't want to override existing error from ram_save_host_page(). */ + if (res < 0 && *res_override >= 0) { + *res_override = res; + } +#else + /* Should never happen */ + qemu_file_set_error(rs->f, -ENOSYS); +#endif /* CONFIG_LINUX */ + } +} + /** * ram_find_and_save_block: finds a dirty page and sends it to f * @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) return pages; } + if (!rs->last_seen_block) { + rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks); + } + pss.block = rs->last_seen_block; pss.page = rs->last_page; pss.complete_round = false; - if (!pss.block) { - pss.block = QLIST_FIRST_RCU(&ram_list.blocks); - } - do { again = true; found = get_queued_page(rs, &pss); @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) } if (found) { + void *opaque; + + ram_save_host_page_pre(rs, &pss, &opaque); pages = ram_save_host_page(rs, &pss, last_stage); + ram_save_host_page_post(rs, &pss, opaque, &pages); } } while (!pages && again); @@ -3864,9 +3997,12 @@ fail: rs->uffdio_fd = -1; return -1; #else + /* + * Should never happen since we prohibit 'background-snapshot' + * capability on non-Linux hosts. + */ rs->uffdio_fd = -1; - error_setg(&migrate_get_current()->error, - "Background-snapshot not supported on non-Linux hosts"); + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); return -1; #endif /* CONFIG_LINUX */ } @@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void) uffd_close_fd(rs->uffdio_fd); rs->uffdio_fd = -1; #else - error_setg(&migrate_get_current()->error, - "Background-snapshot not supported on non-Linux hosts"); + /* + * Should never happen since we prohibit 'background-snapshot' + * capability on non-Linux hosts. + */ + error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR); #endif /* CONFIG_LINUX */ }
In this particular implementation the same single migration thread is responsible for both normal linear dirty page migration and procesing UFFD page fault events. Processing write faults includes reading UFFD file descriptor, finding respective RAM block and saving faulting page to the migration stream. After page has been saved, write protection can be removed. Since asynchronous version of qemu_put_buffer() is expected to be used to save pages, we also have to flush migraion stream prior to un-protecting saved memory range. Write protection is being removed for any previously protected memory chunk that has hit the migration stream. That's valid for pages from linear page scan along with write fault pages. Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> --- migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 147 insertions(+), 8 deletions(-)