Message ID | 1460548364-27469-1-git-send-email-thuth@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > as length parameter, and an address which is directly based on > the page address supplied by the guest. Since the virtio-balloon > protocol is always based on 4k based addresses/sizes, no matter > what the host and guest are using as page size, this has a couple > of issues which could even lead to data loss in the worst case. > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > 4k, we also destroy the 4k areas after the current one - which > might be wrong since the guest did not want free that area yet (in > case the guest used as smaller MMU page size than the hard-coded > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > parameter for the madvise() call instead. > > Then, there's yet another problem: If the host page size is bigger > than the 4k balloon page size, we can not simply call madvise() on > each of the 4k balloon addresses that we get from the guest - since > the madvise() always evicts the whole host page, not only a 4k area! > So in this case, we've got to track the 4k fragments of a host page > and only call madvise(DONTNEED) when all fragments have been collected. > This of course only works fine if the guest sends consecutive 4k > fragments - which is the case in the most important scenarios that > I try to address here (like a ppc64 guest with 64k page size that > is running on a ppc64 host with 64k page size). In case the guest > uses a page size that is smaller than the host page size, we might > need to add some more additional logic here later to increase the > probability of being able to release memory, but at least the guest > should now not crash anymore due to unintentionally evicted pages. > > Signed-off-by: Thomas Huth <thuth@redhat.com> > --- > I've tested this patch with both, a 4k page size ppc64 guest > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > With this patch applied, I was not able to crash to crash the > guests anymore (the 4k guest easily crashes without this patch). > And looking at the output of the "free" command on the host, > the ballooning also still works as expected. > > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++---- > include/hw/virtio/virtio-balloon.h | 3 ++ > 2 files changed, 65 insertions(+), 6 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index c74101e..886faa8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -35,13 +35,56 @@ > #include "hw/virtio/virtio-bus.h" > #include "hw/virtio/virtio-access.h" > > -static void balloon_page(void *addr, int deflate) > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > + > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > { > #if defined(__linux__) > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > - kvm_has_sync_mmu())) { > - qemu_madvise(addr, TARGET_PAGE_SIZE, > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + size_t host_page_size; > + void *aligned_addr; > + > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) { > + return; > + } > + > + host_page_size = getpagesize(); > + if (host_page_size <= BALLOON_PAGE_SIZE) { > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + return; > + } > + > + /* Host page size > ballon page size ==> use aligned host address */ > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); > + if (deflate) { > + /* MADV_WILLNEED is just a hint for the host kernel, the guest should > + * also be able to use the memory again without this call, so let's > + * only do it for the first, aligned fragment of a host page and > + * ignore it for the others. > + */ > + if (addr == aligned_addr) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); > + } > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } else { > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; > + /* If we end up here, that means we want to evict balloon pages, but > + * the host's page size is bigger than the 4k pages from the balloon. > + * Since madvise() only works on the granularity of the host page size, > + * we've got to collect all the 4k fragments from the guest first > + * before we can issue the MADV_DONTNEED call. > + */ > + if (aligned_addr != s->current_addr) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = aligned_addr; So here's where we blow away a partial frag list when the guest advertises pages in non-sequential order. Obviously not ideal, since in the worst-case we'll never call madvise-dontneed. However, this patch fixes a problem, and does likely work with normal guest behavior, so we're probably fine. It might be nice to have some accounting to measure how often this happens though. > + } > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > #endif > } > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) > /* Using memory_region_get_ram_ptr is bending the rules a bit, but > should be OK because we only want a single page. */ > addr = section.offset_within_region; > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > !!(vq == s->dvq)); > memory_region_unref(section.mr); > } > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE > + + sizeof(long) * 8 - 1) / 8; > + s->fragment_bits = g_malloc0(s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > + > reset_stats(s); > > register_savevm(dev, "virtio-balloon", -1, 1, > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) > VirtIODevice *vdev = VIRTIO_DEVICE(dev); > VirtIOBalloon *s = VIRTIO_BALLOON(dev); > > + g_free(s->fragment_bits); > balloon_stats_destroy_timer(s); > qemu_remove_balloon_handler(s); > unregister_savevm(dev, "virtio-balloon", s); > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev) > g_free(s->stats_vq_elem); > s->stats_vq_elem = NULL; > } > + > + if (s->fragment_bits) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > > static void virtio_balloon_instance_init(Object *obj) > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > index 35f62ac..04b7c0c 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > int64_t stats_last_update; > int64_t stats_poll_interval; > uint32_t host_features; > + void *current_addr; > + unsigned long *fragment_bits; > + int fragment_bits_size; > } VirtIOBalloon; > > #endif > -- > 1.8.3.1 > > Other than the accounting suggestion which can be taken or left, Reviewed-by: Andrew Jones <drjones@redhat.com> drew
On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > as length parameter, and an address which is directly based on > the page address supplied by the guest. Since the virtio-balloon > protocol is always based on 4k based addresses/sizes, no matter > what the host and guest are using as page size, this has a couple > of issues which could even lead to data loss in the worst case. > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > 4k, we also destroy the 4k areas after the current one - which > might be wrong since the guest did not want free that area yet (in > case the guest used as smaller MMU page size than the hard-coded > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > parameter for the madvise() call instead. this makes absolute sense. > Then, there's yet another problem: If the host page size is bigger > than the 4k balloon page size, we can not simply call madvise() on > each of the 4k balloon addresses that we get from the guest - since > the madvise() always evicts the whole host page, not only a 4k area! Does it really round length up? Wow, it does: len = (len_in + ~PAGE_MASK) & PAGE_MASK; which seems to be undocumented, but has been there forever. > So in this case, we've got to track the 4k fragments of a host page > and only call madvise(DONTNEED) when all fragments have been collected. > This of course only works fine if the guest sends consecutive 4k > fragments - which is the case in the most important scenarios that > I try to address here (like a ppc64 guest with 64k page size that > is running on a ppc64 host with 64k page size). In case the guest > uses a page size that is smaller than the host page size, we might > need to add some more additional logic here later to increase the > probability of being able to release memory, but at least the guest > should now not crash anymore due to unintentionally evicted pages. > > Signed-off-by: Thomas Huth <thuth@redhat.com> > --- > I've tested this patch with both, a 4k page size ppc64 guest > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > With this patch applied, I was not able to crash to crash the > guests anymore (the 4k guest easily crashes without this patch). > And looking at the output of the "free" command on the host, > the ballooning also still works as expected. > > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++---- > include/hw/virtio/virtio-balloon.h | 3 ++ > 2 files changed, 65 insertions(+), 6 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index c74101e..886faa8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -35,13 +35,56 @@ > #include "hw/virtio/virtio-bus.h" > #include "hw/virtio/virtio-access.h" > > -static void balloon_page(void *addr, int deflate) > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > + > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > { > #if defined(__linux__) > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > - kvm_has_sync_mmu())) { > - qemu_madvise(addr, TARGET_PAGE_SIZE, > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + size_t host_page_size; > + void *aligned_addr; > + > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) { > + return; > + } > + > + host_page_size = getpagesize(); > + if (host_page_size <= BALLOON_PAGE_SIZE) { > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + return; > + } > + > + /* Host page size > ballon page size ==> use aligned host address */ > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); > + if (deflate) { > + /* MADV_WILLNEED is just a hint for the host kernel, the guest should > + * also be able to use the memory again without this call, so let's > + * only do it for the first, aligned fragment of a host page and > + * ignore it for the others. > + */ > + if (addr == aligned_addr) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); > + } > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } else { > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; > + /* If we end up here, that means we want to evict balloon pages, but > + * the host's page size is bigger than the 4k pages from the balloon. > + * Since madvise() only works on the granularity of the host page size, > + * we've got to collect all the 4k fragments from the guest first > + * before we can issue the MADV_DONTNEED call. > + */ > + if (aligned_addr != s->current_addr) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = aligned_addr; > + } > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > #endif > } > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) > /* Using memory_region_get_ram_ptr is bending the rules a bit, but > should be OK because we only want a single page. */ > addr = section.offset_within_region; > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > !!(vq == s->dvq)); > memory_region_unref(section.mr); > } > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE > + + sizeof(long) * 8 - 1) / 8; > + s->fragment_bits = g_malloc0(s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > + > reset_stats(s); > > register_savevm(dev, "virtio-balloon", -1, 1, > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) > VirtIODevice *vdev = VIRTIO_DEVICE(dev); > VirtIOBalloon *s = VIRTIO_BALLOON(dev); > > + g_free(s->fragment_bits); > balloon_stats_destroy_timer(s); > qemu_remove_balloon_handler(s); > unregister_savevm(dev, "virtio-balloon", s); > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev) > g_free(s->stats_vq_elem); > s->stats_vq_elem = NULL; > } > + > + if (s->fragment_bits) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > > static void virtio_balloon_instance_init(Object *obj) > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > index 35f62ac..04b7c0c 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > int64_t stats_last_update; > int64_t stats_poll_interval; > uint32_t host_features; > + void *current_addr; > + unsigned long *fragment_bits; > + int fragment_bits_size; > } VirtIOBalloon; > > #endif > -- > 1.8.3.1 It looks like fragment_bits would have to be migrated. Which is a lot of complexity. And work arounds for specific guest behaviour are really ugly. There are patches on-list to maintain a balloon bitmap - that will enable fixing it cleanly. How about we just skip madvise if host page size is > balloon page size, for 2.6?
On 13.04.2016 15:15, Michael S. Tsirkin wrote: > On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: >> The balloon code currently calls madvise() with TARGET_PAGE_SIZE >> as length parameter, and an address which is directly based on >> the page address supplied by the guest. Since the virtio-balloon >> protocol is always based on 4k based addresses/sizes, no matter >> what the host and guest are using as page size, this has a couple >> of issues which could even lead to data loss in the worst case. >> >> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that >> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than >> 4k, we also destroy the 4k areas after the current one - which >> might be wrong since the guest did not want free that area yet (in >> case the guest used as smaller MMU page size than the hard-coded >> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define >> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size >> parameter for the madvise() call instead. > > this makes absolute sense. > >> Then, there's yet another problem: If the host page size is bigger >> than the 4k balloon page size, we can not simply call madvise() on >> each of the 4k balloon addresses that we get from the guest - since >> the madvise() always evicts the whole host page, not only a 4k area! > > Does it really round length up? > Wow, it does: > len = (len_in + ~PAGE_MASK) & PAGE_MASK; > > which seems to be undocumented, but has been there forever. Yes, that's ugly - I also had to take a look at the kernel sources to understand what this call is supposed to do when being called with a size < PAGE_SIZE. >> So in this case, we've got to track the 4k fragments of a host page >> and only call madvise(DONTNEED) when all fragments have been collected. >> This of course only works fine if the guest sends consecutive 4k >> fragments - which is the case in the most important scenarios that >> I try to address here (like a ppc64 guest with 64k page size that >> is running on a ppc64 host with 64k page size). In case the guest >> uses a page size that is smaller than the host page size, we might >> need to add some more additional logic here later to increase the >> probability of being able to release memory, but at least the guest >> should now not crash anymore due to unintentionally evicted pages. ... >> static void virtio_balloon_instance_init(Object *obj) >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h >> index 35f62ac..04b7c0c 100644 >> --- a/include/hw/virtio/virtio-balloon.h >> +++ b/include/hw/virtio/virtio-balloon.h >> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { >> int64_t stats_last_update; >> int64_t stats_poll_interval; >> uint32_t host_features; >> + void *current_addr; >> + unsigned long *fragment_bits; >> + int fragment_bits_size; >> } VirtIOBalloon; >> >> #endif >> -- >> 1.8.3.1 > > > It looks like fragment_bits would have to be migrated. > Which is a lot of complexity. I think we could simply omit this for now. In case somebody migrates the VM while the ballooning is going on, we'd loose the information for one host page, so we might miss one madvise(DONTNEED), but I think we could live with that. > And work arounds for specific guest behaviour are really ugly. > There are patches on-list to maintain a balloon bitmap - > that will enable fixing it cleanly. Ah, good to know, I wasn't aware of them yet, so that will be a chance for a really proper final solution, I hope. > How about we just skip madvise if host page size is > balloon > page size, for 2.6? That would mean a regression compared to what we have today. Currently, the ballooning is working OK for 64k guests on a 64k ppc host - rather by chance than on purpose, but it's working. The guest is always sending all the 4k fragments of a 64k page, and QEMU is trying to call madvise() for every one of them, but the kernel is ignoring madvise() on non-64k-aligned addresses, so we end up with a situation where the madvise() frees a whole 64k page which is also declared as free by the guest. I think we should either take this patch as it is right now (without adding extra code for migration) and later update it to the bitmap code by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and fix it properly after the bitmap code has been applied. But disabling the balloon code for 64k guests on 64k hosts completely does not sound very appealing to me. What do you think? Thomas
On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: > On 13.04.2016 15:15, Michael S. Tsirkin wrote: > > On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > >> The balloon code currently calls madvise() with TARGET_PAGE_SIZE > >> as length parameter, and an address which is directly based on > >> the page address supplied by the guest. Since the virtio-balloon > >> protocol is always based on 4k based addresses/sizes, no matter > >> what the host and guest are using as page size, this has a couple > >> of issues which could even lead to data loss in the worst case. > >> > >> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > >> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > >> 4k, we also destroy the 4k areas after the current one - which > >> might be wrong since the guest did not want free that area yet (in > >> case the guest used as smaller MMU page size than the hard-coded > >> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > >> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > >> parameter for the madvise() call instead. > > > > this makes absolute sense. > > > >> Then, there's yet another problem: If the host page size is bigger > >> than the 4k balloon page size, we can not simply call madvise() on > >> each of the 4k balloon addresses that we get from the guest - since > >> the madvise() always evicts the whole host page, not only a 4k area! > > > > Does it really round length up? > > Wow, it does: > > len = (len_in + ~PAGE_MASK) & PAGE_MASK; > > > > which seems to be undocumented, but has been there forever. > > Yes, that's ugly - I also had to take a look at the kernel sources to > understand what this call is supposed to do when being called with a > size < PAGE_SIZE. > > >> So in this case, we've got to track the 4k fragments of a host page > >> and only call madvise(DONTNEED) when all fragments have been collected. > >> This of course only works fine if the guest sends consecutive 4k > >> fragments - which is the case in the most important scenarios that > >> I try to address here (like a ppc64 guest with 64k page size that > >> is running on a ppc64 host with 64k page size). In case the guest > >> uses a page size that is smaller than the host page size, we might > >> need to add some more additional logic here later to increase the > >> probability of being able to release memory, but at least the guest > >> should now not crash anymore due to unintentionally evicted pages. > ... > >> static void virtio_balloon_instance_init(Object *obj) > >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > >> index 35f62ac..04b7c0c 100644 > >> --- a/include/hw/virtio/virtio-balloon.h > >> +++ b/include/hw/virtio/virtio-balloon.h > >> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > >> int64_t stats_last_update; > >> int64_t stats_poll_interval; > >> uint32_t host_features; > >> + void *current_addr; > >> + unsigned long *fragment_bits; > >> + int fragment_bits_size; > >> } VirtIOBalloon; > >> > >> #endif > >> -- > >> 1.8.3.1 > > > > > > It looks like fragment_bits would have to be migrated. > > Which is a lot of complexity. > > I think we could simply omit this for now. In case somebody migrates the > VM while the ballooning is going on, we'd loose the information for one > host page, so we might miss one madvise(DONTNEED), but I think we could > live with that. > > > And work arounds for specific guest behaviour are really ugly. > > There are patches on-list to maintain a balloon bitmap - > > that will enable fixing it cleanly. > > Ah, good to know, I wasn't aware of them yet, so that will be a chance > for a really proper final solution, I hope. > > > How about we just skip madvise if host page size is > balloon > > page size, for 2.6? > > That would mean a regression compared to what we have today. Currently, > the ballooning is working OK for 64k guests on a 64k ppc host - rather > by chance than on purpose, but it's working. The guest is always sending > all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > for every one of them, but the kernel is ignoring madvise() on > non-64k-aligned addresses, so we end up with a situation where the > madvise() frees a whole 64k page which is also declared as free by the > guest. > > I think we should either take this patch as it is right now (without > adding extra code for migration) and later update it to the bitmap code > by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > fix it properly after the bitmap code has been applied. But disabling > the balloon code for 64k guests on 64k hosts completely does not sound > very appealing to me. What do you think? > > Thomas True. As simple a hack - how about disabling madvise when host page size > target page size?
On 13.04.2016 19:07, Michael S. Tsirkin wrote: > On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: >> On 13.04.2016 15:15, Michael S. Tsirkin wrote: >>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: ... >>>> Then, there's yet another problem: If the host page size is bigger >>>> than the 4k balloon page size, we can not simply call madvise() on >>>> each of the 4k balloon addresses that we get from the guest - since >>>> the madvise() always evicts the whole host page, not only a 4k area! >>>> >>>> So in this case, we've got to track the 4k fragments of a host page >>>> and only call madvise(DONTNEED) when all fragments have been collected. >>>> This of course only works fine if the guest sends consecutive 4k >>>> fragments - which is the case in the most important scenarios that >>>> I try to address here (like a ppc64 guest with 64k page size that >>>> is running on a ppc64 host with 64k page size). In case the guest >>>> uses a page size that is smaller than the host page size, we might >>>> need to add some more additional logic here later to increase the >>>> probability of being able to release memory, but at least the guest >>>> should now not crash anymore due to unintentionally evicted pages. >> ... >>>> static void virtio_balloon_instance_init(Object *obj) >>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h >>>> index 35f62ac..04b7c0c 100644 >>>> --- a/include/hw/virtio/virtio-balloon.h >>>> +++ b/include/hw/virtio/virtio-balloon.h >>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { >>>> int64_t stats_last_update; >>>> int64_t stats_poll_interval; >>>> uint32_t host_features; >>>> + void *current_addr; >>>> + unsigned long *fragment_bits; >>>> + int fragment_bits_size; >>>> } VirtIOBalloon; >>>> >>>> #endif >>> >>> It looks like fragment_bits would have to be migrated. >>> Which is a lot of complexity. ... >>> How about we just skip madvise if host page size is > balloon >>> page size, for 2.6? >> >> That would mean a regression compared to what we have today. Currently, >> the ballooning is working OK for 64k guests on a 64k ppc host - rather >> by chance than on purpose, but it's working. The guest is always sending >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() >> for every one of them, but the kernel is ignoring madvise() on >> non-64k-aligned addresses, so we end up with a situation where the >> madvise() frees a whole 64k page which is also declared as free by the >> guest. >> >> I think we should either take this patch as it is right now (without >> adding extra code for migration) and later update it to the bitmap code >> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and >> fix it properly after the bitmap code has been applied. But disabling >> the balloon code for 64k guests on 64k hosts completely does not sound >> very appealing to me. What do you think? >> >> Thomas > > True. As simple a hack - how about disabling madvise when host page size > > target page size? That could work - but is there a generic way in QEMU to get the current page size from a guest (since this might differ from TARGET_PAGE_SIZE)? Or would that mean to pollute the virtio-balloon code with ugly #ifdefs? Thomas
On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote: > On 13.04.2016 19:07, Michael S. Tsirkin wrote: > > On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: > >> On 13.04.2016 15:15, Michael S. Tsirkin wrote: > >>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > ... > >>>> Then, there's yet another problem: If the host page size is bigger > >>>> than the 4k balloon page size, we can not simply call madvise() on > >>>> each of the 4k balloon addresses that we get from the guest - since > >>>> the madvise() always evicts the whole host page, not only a 4k area! > >>>> > >>>> So in this case, we've got to track the 4k fragments of a host page > >>>> and only call madvise(DONTNEED) when all fragments have been collected. > >>>> This of course only works fine if the guest sends consecutive 4k > >>>> fragments - which is the case in the most important scenarios that > >>>> I try to address here (like a ppc64 guest with 64k page size that > >>>> is running on a ppc64 host with 64k page size). In case the guest > >>>> uses a page size that is smaller than the host page size, we might > >>>> need to add some more additional logic here later to increase the > >>>> probability of being able to release memory, but at least the guest > >>>> should now not crash anymore due to unintentionally evicted pages. > >> ... > >>>> static void virtio_balloon_instance_init(Object *obj) > >>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > >>>> index 35f62ac..04b7c0c 100644 > >>>> --- a/include/hw/virtio/virtio-balloon.h > >>>> +++ b/include/hw/virtio/virtio-balloon.h > >>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > >>>> int64_t stats_last_update; > >>>> int64_t stats_poll_interval; > >>>> uint32_t host_features; > >>>> + void *current_addr; > >>>> + unsigned long *fragment_bits; > >>>> + int fragment_bits_size; > >>>> } VirtIOBalloon; > >>>> > >>>> #endif > >>> > >>> It looks like fragment_bits would have to be migrated. > >>> Which is a lot of complexity. > ... > >>> How about we just skip madvise if host page size is > balloon > >>> page size, for 2.6? > >> > >> That would mean a regression compared to what we have today. Currently, > >> the ballooning is working OK for 64k guests on a 64k ppc host - rather > >> by chance than on purpose, but it's working. The guest is always sending > >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > >> for every one of them, but the kernel is ignoring madvise() on > >> non-64k-aligned addresses, so we end up with a situation where the > >> madvise() frees a whole 64k page which is also declared as free by the > >> guest. > >> > >> I think we should either take this patch as it is right now (without > >> adding extra code for migration) and later update it to the bitmap code > >> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > >> fix it properly after the bitmap code has been applied. But disabling > >> the balloon code for 64k guests on 64k hosts completely does not sound > >> very appealing to me. What do you think? > >> > >> Thomas > > > > True. As simple a hack - how about disabling madvise when host page size > > > target page size? > > That could work - but is there a generic way in QEMU to get the current > page size from a guest (since this might differ from TARGET_PAGE_SIZE)? > Or would that mean to pollute the virtio-balloon code with ugly #ifdefs? > > Thomas let's just use TARGET_PAGE_SIZE, that's the best I can think of.
On 13.04.2016 19:55, Michael S. Tsirkin wrote: > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote: >> On 13.04.2016 19:07, Michael S. Tsirkin wrote: >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote: >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: >> ... >>>>>> Then, there's yet another problem: If the host page size is bigger >>>>>> than the 4k balloon page size, we can not simply call madvise() on >>>>>> each of the 4k balloon addresses that we get from the guest - since >>>>>> the madvise() always evicts the whole host page, not only a 4k area! >>>>>> >>>>>> So in this case, we've got to track the 4k fragments of a host page >>>>>> and only call madvise(DONTNEED) when all fragments have been collected. >>>>>> This of course only works fine if the guest sends consecutive 4k >>>>>> fragments - which is the case in the most important scenarios that >>>>>> I try to address here (like a ppc64 guest with 64k page size that >>>>>> is running on a ppc64 host with 64k page size). In case the guest >>>>>> uses a page size that is smaller than the host page size, we might >>>>>> need to add some more additional logic here later to increase the >>>>>> probability of being able to release memory, but at least the guest >>>>>> should now not crash anymore due to unintentionally evicted pages. >>>> ... >>>>>> static void virtio_balloon_instance_init(Object *obj) >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h >>>>>> index 35f62ac..04b7c0c 100644 >>>>>> --- a/include/hw/virtio/virtio-balloon.h >>>>>> +++ b/include/hw/virtio/virtio-balloon.h >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { >>>>>> int64_t stats_last_update; >>>>>> int64_t stats_poll_interval; >>>>>> uint32_t host_features; >>>>>> + void *current_addr; >>>>>> + unsigned long *fragment_bits; >>>>>> + int fragment_bits_size; >>>>>> } VirtIOBalloon; >>>>>> >>>>>> #endif >>>>> >>>>> It looks like fragment_bits would have to be migrated. >>>>> Which is a lot of complexity. >> ... >>>>> How about we just skip madvise if host page size is > balloon >>>>> page size, for 2.6? >>>> >>>> That would mean a regression compared to what we have today. Currently, >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather >>>> by chance than on purpose, but it's working. The guest is always sending >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() >>>> for every one of them, but the kernel is ignoring madvise() on >>>> non-64k-aligned addresses, so we end up with a situation where the >>>> madvise() frees a whole 64k page which is also declared as free by the >>>> guest. >>>> >>>> I think we should either take this patch as it is right now (without >>>> adding extra code for migration) and later update it to the bitmap code >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and >>>> fix it properly after the bitmap code has been applied. But disabling >>>> the balloon code for 64k guests on 64k hosts completely does not sound >>>> very appealing to me. What do you think? >>> >>> True. As simple a hack - how about disabling madvise when host page size > >>> target page size? >> >> That could work - but is there a generic way in QEMU to get the current >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)? >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs? > > let's just use TARGET_PAGE_SIZE, that's the best I can think of. That won't work - at least not on ppc: TARGET_PAGE_SIZE is always defined to 4096 here. The Linux kernel then switches the real page size during runtime to 65536. So we'd need a way to detect this automatically... Thomas
On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote: > On 13.04.2016 19:55, Michael S. Tsirkin wrote: > > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote: > >> On 13.04.2016 19:07, Michael S. Tsirkin wrote: > >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: > >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote: > >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > >> ... > >>>>>> Then, there's yet another problem: If the host page size is bigger > >>>>>> than the 4k balloon page size, we can not simply call madvise() on > >>>>>> each of the 4k balloon addresses that we get from the guest - since > >>>>>> the madvise() always evicts the whole host page, not only a 4k area! > >>>>>> > >>>>>> So in this case, we've got to track the 4k fragments of a host page > >>>>>> and only call madvise(DONTNEED) when all fragments have been collected. > >>>>>> This of course only works fine if the guest sends consecutive 4k > >>>>>> fragments - which is the case in the most important scenarios that > >>>>>> I try to address here (like a ppc64 guest with 64k page size that > >>>>>> is running on a ppc64 host with 64k page size). In case the guest > >>>>>> uses a page size that is smaller than the host page size, we might > >>>>>> need to add some more additional logic here later to increase the > >>>>>> probability of being able to release memory, but at least the guest > >>>>>> should now not crash anymore due to unintentionally evicted pages. > >>>> ... > >>>>>> static void virtio_balloon_instance_init(Object *obj) > >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > >>>>>> index 35f62ac..04b7c0c 100644 > >>>>>> --- a/include/hw/virtio/virtio-balloon.h > >>>>>> +++ b/include/hw/virtio/virtio-balloon.h > >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > >>>>>> int64_t stats_last_update; > >>>>>> int64_t stats_poll_interval; > >>>>>> uint32_t host_features; > >>>>>> + void *current_addr; > >>>>>> + unsigned long *fragment_bits; > >>>>>> + int fragment_bits_size; > >>>>>> } VirtIOBalloon; > >>>>>> > >>>>>> #endif > >>>>> > >>>>> It looks like fragment_bits would have to be migrated. > >>>>> Which is a lot of complexity. > >> ... > >>>>> How about we just skip madvise if host page size is > balloon > >>>>> page size, for 2.6? > >>>> > >>>> That would mean a regression compared to what we have today. Currently, > >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather > >>>> by chance than on purpose, but it's working. The guest is always sending > >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > >>>> for every one of them, but the kernel is ignoring madvise() on > >>>> non-64k-aligned addresses, so we end up with a situation where the > >>>> madvise() frees a whole 64k page which is also declared as free by the > >>>> guest. > >>>> > >>>> I think we should either take this patch as it is right now (without > >>>> adding extra code for migration) and later update it to the bitmap code > >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > >>>> fix it properly after the bitmap code has been applied. But disabling > >>>> the balloon code for 64k guests on 64k hosts completely does not sound > >>>> very appealing to me. What do you think? > >>> > >>> True. As simple a hack - how about disabling madvise when host page size > > >>> target page size? > >> > >> That could work - but is there a generic way in QEMU to get the current > >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)? > >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs? > > > > let's just use TARGET_PAGE_SIZE, that's the best I can think of. > > That won't work - at least not on ppc: TARGET_PAGE_SIZE is always > defined to 4096 here. The Linux kernel then switches the real page size > during runtime to 65536. So we'd need a way to detect this automatically... > > Thomas I see. I don't know how to do this. If we can't find a quick fix, leave this part around for now then? Just fix the page size.
On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote: > On 13.04.2016 19:55, Michael S. Tsirkin wrote: > > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote: > >> On 13.04.2016 19:07, Michael S. Tsirkin wrote: > >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: > >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote: > >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > >> ... > >>>>>> Then, there's yet another problem: If the host page size is bigger > >>>>>> than the 4k balloon page size, we can not simply call madvise() on > >>>>>> each of the 4k balloon addresses that we get from the guest - since > >>>>>> the madvise() always evicts the whole host page, not only a 4k area! > >>>>>> > >>>>>> So in this case, we've got to track the 4k fragments of a host page > >>>>>> and only call madvise(DONTNEED) when all fragments have been collected. > >>>>>> This of course only works fine if the guest sends consecutive 4k > >>>>>> fragments - which is the case in the most important scenarios that > >>>>>> I try to address here (like a ppc64 guest with 64k page size that > >>>>>> is running on a ppc64 host with 64k page size). In case the guest > >>>>>> uses a page size that is smaller than the host page size, we might > >>>>>> need to add some more additional logic here later to increase the > >>>>>> probability of being able to release memory, but at least the guest > >>>>>> should now not crash anymore due to unintentionally evicted pages. > >>>> ... > >>>>>> static void virtio_balloon_instance_init(Object *obj) > >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > >>>>>> index 35f62ac..04b7c0c 100644 > >>>>>> --- a/include/hw/virtio/virtio-balloon.h > >>>>>> +++ b/include/hw/virtio/virtio-balloon.h > >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > >>>>>> int64_t stats_last_update; > >>>>>> int64_t stats_poll_interval; > >>>>>> uint32_t host_features; > >>>>>> + void *current_addr; > >>>>>> + unsigned long *fragment_bits; > >>>>>> + int fragment_bits_size; > >>>>>> } VirtIOBalloon; > >>>>>> > >>>>>> #endif > >>>>> > >>>>> It looks like fragment_bits would have to be migrated. > >>>>> Which is a lot of complexity. > >> ... > >>>>> How about we just skip madvise if host page size is > balloon > >>>>> page size, for 2.6? > >>>> > >>>> That would mean a regression compared to what we have today. Currently, > >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather > >>>> by chance than on purpose, but it's working. The guest is always sending > >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > >>>> for every one of them, but the kernel is ignoring madvise() on > >>>> non-64k-aligned addresses, so we end up with a situation where the > >>>> madvise() frees a whole 64k page which is also declared as free by the > >>>> guest. > >>>> > >>>> I think we should either take this patch as it is right now (without > >>>> adding extra code for migration) and later update it to the bitmap code > >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > >>>> fix it properly after the bitmap code has been applied. But disabling > >>>> the balloon code for 64k guests on 64k hosts completely does not sound > >>>> very appealing to me. What do you think? > >>> > >>> True. As simple a hack - how about disabling madvise when host page size > > >>> target page size? > >> > >> That could work - but is there a generic way in QEMU to get the current > >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)? > >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs? > > > > let's just use TARGET_PAGE_SIZE, that's the best I can think of. > > That won't work - at least not on ppc: TARGET_PAGE_SIZE is always > defined to 4096 here. The Linux kernel then switches the real page size > during runtime to 65536. So we'd need a way to detect this automatically... Won't work for ARM either. TARGET_PAGE_SIZE is always 1K. The host will always have at least 4K, but the guest could have as much as 64K. There's no non-hacky way (that I know of) to determine what it is. Thanks, drew
On Wed, 13 Apr 2016 13:52:44 +0200 Thomas Huth <thuth@redhat.com> wrote: > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > as length parameter, and an address which is directly based on > the page address supplied by the guest. Since the virtio-balloon > protocol is always based on 4k based addresses/sizes, no matter > what the host and guest are using as page size, this has a couple > of issues which could even lead to data loss in the worst case. > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > 4k, we also destroy the 4k areas after the current one - which > might be wrong since the guest did not want free that area yet (in > case the guest used as smaller MMU page size than the hard-coded > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > parameter for the madvise() call instead. > > Then, there's yet another problem: If the host page size is bigger > than the 4k balloon page size, we can not simply call madvise() on > each of the 4k balloon addresses that we get from the guest - since > the madvise() always evicts the whole host page, not only a 4k area! > So in this case, we've got to track the 4k fragments of a host page > and only call madvise(DONTNEED) when all fragments have been collected. > This of course only works fine if the guest sends consecutive 4k > fragments - which is the case in the most important scenarios that > I try to address here (like a ppc64 guest with 64k page size that > is running on a ppc64 host with 64k page size). In case the guest > uses a page size that is smaller than the host page size, we might > need to add some more additional logic here later to increase the > probability of being able to release memory, but at least the guest > should now not crash anymore due to unintentionally evicted pages. Yeah, this is a serious bug. I think the patch is basically ok, except for a couple of minor points noted below. It's a bit more complex than I'd generally like to apply during hard freeze, but given the seriousness of the problem, I think we should put this in for qemu-2.6. > > Signed-off-by: Thomas Huth <thuth@redhat.com> > --- > I've tested this patch with both, a 4k page size ppc64 guest > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > With this patch applied, I was not able to crash to crash the > guests anymore (the 4k guest easily crashes without this patch). > And looking at the output of the "free" command on the host, > the ballooning also still works as expected. > > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++---- > include/hw/virtio/virtio-balloon.h | 3 ++ > 2 files changed, 65 insertions(+), 6 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index c74101e..886faa8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -35,13 +35,56 @@ > #include "hw/virtio/virtio-bus.h" > #include "hw/virtio/virtio-access.h" > > -static void balloon_page(void *addr, int deflate) > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > + > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > { > #if defined(__linux__) > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > - kvm_has_sync_mmu())) { > - qemu_madvise(addr, TARGET_PAGE_SIZE, > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + size_t host_page_size; > + void *aligned_addr; > + > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) { > + return; > + } > + > + host_page_size = getpagesize(); I think you actually want getrampagesize() here (or rather some wrapper around that to handle the non-kvm cases properly) - to cover the case where the VM has been built using hugepages in the host. For the normal case there's already the qemu_real_host_page_size global, so you shouldn't need to call getpagesize() yourself. > + if (host_page_size <= BALLOON_PAGE_SIZE) { > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + return; > + } > + > + /* Host page size > ballon page size ==> use aligned host address */ > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); > + if (deflate) { > + /* MADV_WILLNEED is just a hint for the host kernel, the guest should > + * also be able to use the memory again without this call, so let's > + * only do it for the first, aligned fragment of a host page and > + * ignore it for the others. > + */ > + if (addr == aligned_addr) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); > + } > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } else { > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; > + /* If we end up here, that means we want to evict balloon pages, but > + * the host's page size is bigger than the 4k pages from the balloon. > + * Since madvise() only works on the granularity of the host page size, > + * we've got to collect all the 4k fragments from the guest first > + * before we can issue the MADV_DONTNEED call. > + */ > + if (aligned_addr != s->current_addr) { I'd suggest a (non fatal) error_report() here if s->current_addr != BALLOON_NO_CURRENT_PAGE. > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = aligned_addr; > + } > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { If you invert the sense of the bitmap you can just check for an all-zero bitmap here which might be a little simpler. > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > #endif > } > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) > /* Using memory_region_get_ram_ptr is bending the rules a bit, but > should be OK because we only want a single page. */ > addr = section.offset_within_region; > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > !!(vq == s->dvq)); > memory_region_unref(section.mr); > } > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE > + + sizeof(long) * 8 - 1) / 8; > + s->fragment_bits = g_malloc0(s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > + > reset_stats(s); > > register_savevm(dev, "virtio-balloon", -1, 1, > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) > VirtIODevice *vdev = VIRTIO_DEVICE(dev); > VirtIOBalloon *s = VIRTIO_BALLOON(dev); > > + g_free(s->fragment_bits); > balloon_stats_destroy_timer(s); > qemu_remove_balloon_handler(s); > unregister_savevm(dev, "virtio-balloon", s); > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev) > g_free(s->stats_vq_elem); > s->stats_vq_elem = NULL; > } > + > + if (s->fragment_bits) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > > static void virtio_balloon_instance_init(Object *obj) > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > index 35f62ac..04b7c0c 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > int64_t stats_last_update; > int64_t stats_poll_interval; > uint32_t host_features; > + void *current_addr; > + unsigned long *fragment_bits; > + int fragment_bits_size; > } VirtIOBalloon; > > #endif > -- > 1.8.3.1 >
On Wed, 13 Apr 2016 16:15:25 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > > as length parameter, and an address which is directly based on > > the page address supplied by the guest. Since the virtio-balloon > > protocol is always based on 4k based addresses/sizes, no matter > > what the host and guest are using as page size, this has a couple > > of issues which could even lead to data loss in the worst case. > > > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > > 4k, we also destroy the 4k areas after the current one - which > > might be wrong since the guest did not want free that area yet (in > > case the guest used as smaller MMU page size than the hard-coded > > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > > parameter for the madvise() call instead. > > this makes absolute sense. > > > > Then, there's yet another problem: If the host page size is bigger > > than the 4k balloon page size, we can not simply call madvise() on > > each of the 4k balloon addresses that we get from the guest - since > > the madvise() always evicts the whole host page, not only a 4k area! > > Does it really round length up? > Wow, it does: > len = (len_in + ~PAGE_MASK) & PAGE_MASK; > > which seems to be undocumented, but has been there forever. > > > > So in this case, we've got to track the 4k fragments of a host page > > and only call madvise(DONTNEED) when all fragments have been collected. > > This of course only works fine if the guest sends consecutive 4k > > fragments - which is the case in the most important scenarios that > > I try to address here (like a ppc64 guest with 64k page size that > > is running on a ppc64 host with 64k page size). In case the guest > > uses a page size that is smaller than the host page size, we might > > need to add some more additional logic here later to increase the > > probability of being able to release memory, but at least the guest > > should now not crash anymore due to unintentionally evicted pages. > > > > Signed-off-by: Thomas Huth <thuth@redhat.com> > > --- > > I've tested this patch with both, a 4k page size ppc64 guest > > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > > With this patch applied, I was not able to crash to crash the > > guests anymore (the 4k guest easily crashes without this patch). > > And looking at the output of the "free" command on the host, > > the ballooning also still works as expected. > > > > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++---- > > include/hw/virtio/virtio-balloon.h | 3 ++ > > 2 files changed, 65 insertions(+), 6 deletions(-) > > > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > > index c74101e..886faa8 100644 > > --- a/hw/virtio/virtio-balloon.c > > +++ b/hw/virtio/virtio-balloon.c > > @@ -35,13 +35,56 @@ > > #include "hw/virtio/virtio-bus.h" > > #include "hw/virtio/virtio-access.h" > > > > -static void balloon_page(void *addr, int deflate) > > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > > + > > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > > { > > #if defined(__linux__) > > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > > - kvm_has_sync_mmu())) { > > - qemu_madvise(addr, TARGET_PAGE_SIZE, > > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > > + size_t host_page_size; > > + void *aligned_addr; > > + > > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) { > > + return; > > + } > > + > > + host_page_size = getpagesize(); > > + if (host_page_size <= BALLOON_PAGE_SIZE) { > > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > > + return; > > + } > > + > > + /* Host page size > ballon page size ==> use aligned host address */ > > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); > > + if (deflate) { > > + /* MADV_WILLNEED is just a hint for the host kernel, the guest should > > + * also be able to use the memory again without this call, so let's > > + * only do it for the first, aligned fragment of a host page and > > + * ignore it for the others. > > + */ > > + if (addr == aligned_addr) { > > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); > > + } > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } else { > > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; > > + /* If we end up here, that means we want to evict balloon pages, but > > + * the host's page size is bigger than the 4k pages from the balloon. > > + * Since madvise() only works on the granularity of the host page size, > > + * we've got to collect all the 4k fragments from the guest first > > + * before we can issue the MADV_DONTNEED call. > > + */ > > + if (aligned_addr != s->current_addr) { > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr = aligned_addr; > > + } > > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); > > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { > > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } > > } > > #endif > > } > > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) > > /* Using memory_region_get_ram_ptr is bending the rules a bit, but > > should be OK because we only want a single page. */ > > addr = section.offset_within_region; > > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > > !!(vq == s->dvq)); > > memory_region_unref(section.mr); > > } > > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) > > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > > > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE > > + + sizeof(long) * 8 - 1) / 8; > > + s->fragment_bits = g_malloc0(s->fragment_bits_size); > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } > > + > > reset_stats(s); > > > > register_savevm(dev, "virtio-balloon", -1, 1, > > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) > > VirtIODevice *vdev = VIRTIO_DEVICE(dev); > > VirtIOBalloon *s = VIRTIO_BALLOON(dev); > > > > + g_free(s->fragment_bits); > > balloon_stats_destroy_timer(s); > > qemu_remove_balloon_handler(s); > > unregister_savevm(dev, "virtio-balloon", s); > > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev) > > g_free(s->stats_vq_elem); > > s->stats_vq_elem = NULL; > > } > > + > > + if (s->fragment_bits) { > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } > > } > > > > static void virtio_balloon_instance_init(Object *obj) > > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > > index 35f62ac..04b7c0c 100644 > > --- a/include/hw/virtio/virtio-balloon.h > > +++ b/include/hw/virtio/virtio-balloon.h > > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > > int64_t stats_last_update; > > int64_t stats_poll_interval; > > uint32_t host_features; > > + void *current_addr; > > + unsigned long *fragment_bits; > > + int fragment_bits_size; > > } VirtIOBalloon; > > > > #endif > > -- > > 1.8.3.1 > > > It looks like fragment_bits would have to be migrated. Hmm.. do they? If we just ignore migration, isn't the worse that happens that we just keep one extra page allocated. > Which is a lot of complexity. > And work arounds for specific guest behaviour are really ugly. > There are patches on-list to maintain a balloon bitmap - > that will enable fixing it cleanly. > How about we just skip madvise if host page size is > balloon > page size, for 2.6? > > -- > MST
On Wed, 13 Apr 2016 21:14:44 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote: > > On 13.04.2016 19:55, Michael S. Tsirkin wrote: > > > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote: > > >> On 13.04.2016 19:07, Michael S. Tsirkin wrote: > > >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: > > >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote: > > >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > > >> ... > > >>>>>> Then, there's yet another problem: If the host page size is bigger > > >>>>>> than the 4k balloon page size, we can not simply call madvise() on > > >>>>>> each of the 4k balloon addresses that we get from the guest - since > > >>>>>> the madvise() always evicts the whole host page, not only a 4k area! > > >>>>>> > > >>>>>> So in this case, we've got to track the 4k fragments of a host page > > >>>>>> and only call madvise(DONTNEED) when all fragments have been collected. > > >>>>>> This of course only works fine if the guest sends consecutive 4k > > >>>>>> fragments - which is the case in the most important scenarios that > > >>>>>> I try to address here (like a ppc64 guest with 64k page size that > > >>>>>> is running on a ppc64 host with 64k page size). In case the guest > > >>>>>> uses a page size that is smaller than the host page size, we might > > >>>>>> need to add some more additional logic here later to increase the > > >>>>>> probability of being able to release memory, but at least the guest > > >>>>>> should now not crash anymore due to unintentionally evicted pages. > > >>>> ... > > >>>>>> static void virtio_balloon_instance_init(Object *obj) > > >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > > >>>>>> index 35f62ac..04b7c0c 100644 > > >>>>>> --- a/include/hw/virtio/virtio-balloon.h > > >>>>>> +++ b/include/hw/virtio/virtio-balloon.h > > >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > > >>>>>> int64_t stats_last_update; > > >>>>>> int64_t stats_poll_interval; > > >>>>>> uint32_t host_features; > > >>>>>> + void *current_addr; > > >>>>>> + unsigned long *fragment_bits; > > >>>>>> + int fragment_bits_size; > > >>>>>> } VirtIOBalloon; > > >>>>>> > > >>>>>> #endif > > >>>>> > > >>>>> It looks like fragment_bits would have to be migrated. > > >>>>> Which is a lot of complexity. > > >> ... > > >>>>> How about we just skip madvise if host page size is > balloon > > >>>>> page size, for 2.6? > > >>>> > > >>>> That would mean a regression compared to what we have today. Currently, > > >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather > > >>>> by chance than on purpose, but it's working. The guest is always sending > > >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > > >>>> for every one of them, but the kernel is ignoring madvise() on > > >>>> non-64k-aligned addresses, so we end up with a situation where the > > >>>> madvise() frees a whole 64k page which is also declared as free by the > > >>>> guest. > > >>>> > > >>>> I think we should either take this patch as it is right now (without > > >>>> adding extra code for migration) and later update it to the bitmap code > > >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > > >>>> fix it properly after the bitmap code has been applied. But disabling > > >>>> the balloon code for 64k guests on 64k hosts completely does not sound > > >>>> very appealing to me. What do you think? > > >>> > > >>> True. As simple a hack - how about disabling madvise when host page size > > > >>> target page size? > > >> > > >> That could work - but is there a generic way in QEMU to get the current > > >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)? > > >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs? > > > > > > let's just use TARGET_PAGE_SIZE, that's the best I can think of. > > > > That won't work - at least not on ppc: TARGET_PAGE_SIZE is always > > defined to 4096 here. The Linux kernel then switches the real page size > > during runtime to 65536. So we'd need a way to detect this automatically... > > > > Thomas > > I see. I don't know how to do this. If we can't find a quick fix, leave > this part around for now then? Just fix the page size. I'm not sure what you're suggesting by "fix the page size". TARGET_PAGE_SIZE remains 4kiB because that's the default hardware page size. However, modern Linux guests always use 64kiB pages in practice. This isn't a global setting, it just means it uses 64kiB pages for every mapping it establishes, so there's really no sane way for qemu to determine the "guest page size" because that's not really a well-defined concept. Even having some flag that's cleared if the guest ever makes a < 64kiB mapping won't work, because I think there are a few edge cases where a "64kiB" guest which uses 64kiB pages for all RAM mappings will use 4kiB pages for certain IO mappings.
* Thomas Huth (thuth@redhat.com) wrote: > That would mean a regression compared to what we have today. Currently, > the ballooning is working OK for 64k guests on a 64k ppc host - rather > by chance than on purpose, but it's working. The guest is always sending > all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > for every one of them, but the kernel is ignoring madvise() on > non-64k-aligned addresses, so we end up with a situation where the > madvise() frees a whole 64k page which is also declared as free by the > guest. I wouldn't worry about migrating your fragmenet map; but I wonder if it needs to be that complex - does the guest normally do something more sane like do the 4k pages in order and so you've just got to track the last page it tried rather than having a full map? A side question is whether the behaviour that's seen by virtio_ballon_handle_output is always actually the full 64k page; it calls balloon_page once for each message/element - but if all of those elements add back up to the full page, perhaps it makes more sense to reassemble it there? > I think we should either take this patch as it is right now (without > adding extra code for migration) and later update it to the bitmap code > by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > fix it properly after the bitmap code has been applied. But disabling > the balloon code for 64k guests on 64k hosts completely does not sound > very appealing to me. What do you think? Yeh I agree; your existing code should work and I don't think we should break 64k-on-64k. Dave > > Thomas > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 14.04.2016 13:47, Dr. David Alan Gilbert wrote: > * Thomas Huth (thuth@redhat.com) wrote: > >> That would mean a regression compared to what we have today. Currently, >> the ballooning is working OK for 64k guests on a 64k ppc host - rather >> by chance than on purpose, but it's working. The guest is always sending >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() >> for every one of them, but the kernel is ignoring madvise() on >> non-64k-aligned addresses, so we end up with a situation where the >> madvise() frees a whole 64k page which is also declared as free by the >> guest. > > I wouldn't worry about migrating your fragmenet map; but I wonder if it > needs to be that complex - does the guest normally do something more sane > like do the 4k pages in order and so you've just got to track the last > page it tried rather than having a full map? That's maybe a little bit easier and might work for well-known Linux guests, but IMHO it's even more a hack than my approach: If the Linux driver one day is switched to send the pages in the opposite order, or if somebody tries to run a non-wellknown (i.e. non-Linux) guest, this does not work at all anymore. > A side question is whether the behaviour that's seen by virtio_ballon_handle_output > is always actually the full 64k page; it calls balloon_page once > for each message/element - but if all of those elements add back up to the full > page, perhaps it makes more sense to reassemble it there? That might work for 64k page size guests ... but for 4k guests, I think you'll have a hard time to reassemble a page there more easily than with my current approach. Or do you have a clever algorithm in mind that could do the job well there? Thomas
* Thomas Huth (thuth@redhat.com) wrote: > On 14.04.2016 13:47, Dr. David Alan Gilbert wrote: > > * Thomas Huth (thuth@redhat.com) wrote: > > > >> That would mean a regression compared to what we have today. Currently, > >> the ballooning is working OK for 64k guests on a 64k ppc host - rather > >> by chance than on purpose, but it's working. The guest is always sending > >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > >> for every one of them, but the kernel is ignoring madvise() on > >> non-64k-aligned addresses, so we end up with a situation where the > >> madvise() frees a whole 64k page which is also declared as free by the > >> guest. > > > > I wouldn't worry about migrating your fragmenet map; but I wonder if it > > needs to be that complex - does the guest normally do something more sane > > like do the 4k pages in order and so you've just got to track the last > > page it tried rather than having a full map? > > That's maybe a little bit easier and might work for well-known Linux > guests, but IMHO it's even more a hack than my approach: If the Linux > driver one day is switched to send the pages in the opposite order, or > if somebody tries to run a non-wellknown (i.e. non-Linux) guest, this > does not work at all anymore. True. > > A side question is whether the behaviour that's seen by virtio_ballon_handle_output > > is always actually the full 64k page; it calls balloon_page once > > for each message/element - but if all of those elements add back up to the full > > page, perhaps it makes more sense to reassemble it there? > > That might work for 64k page size guests ... but for 4k guests, I think > you'll have a hard time to reassemble a page there more easily than with > my current approach. Or do you have a clever algorithm in mind that > could do the job well there? No, i didn't; I just have an ulterior motive which is trying to do as few madvise's as possible, and while virtio_balloon_handle_output sees potentially quite a few requests at once, balloon_page is stuck down there at the bottom without any idea of whether there are any more coming. Dave > Thomas > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Thu, 14 Apr 2016 19:34:05 +0100 "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > * Thomas Huth (thuth@redhat.com) wrote: > > On 14.04.2016 13:47, Dr. David Alan Gilbert wrote: > > > * Thomas Huth (thuth@redhat.com) wrote: > > > > > >> That would mean a regression compared to what we have today. Currently, > > >> the ballooning is working OK for 64k guests on a 64k ppc host - rather > > >> by chance than on purpose, but it's working. The guest is always sending > > >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > > >> for every one of them, but the kernel is ignoring madvise() on > > >> non-64k-aligned addresses, so we end up with a situation where the > > >> madvise() frees a whole 64k page which is also declared as free by the > > >> guest. > > > > > > I wouldn't worry about migrating your fragmenet map; but I wonder if it > > > needs to be that complex - does the guest normally do something more sane > > > like do the 4k pages in order and so you've just got to track the last > > > page it tried rather than having a full map? > > > > That's maybe a little bit easier and might work for well-known Linux > > guests, but IMHO it's even more a hack than my approach: If the Linux > > driver one day is switched to send the pages in the opposite order, or > > if somebody tries to run a non-wellknown (i.e. non-Linux) guest, this > > does not work at all anymore. > > True. TBH, I'm not sure that basing off last sub-page ballooned will even be that much easier to implement, or at least to implement in a way that's convincingly correct even for the limited cases it's supposed to work in. > > > A side question is whether the behaviour that's seen by virtio_ballon_handle_output > > > is always actually the full 64k page; it calls balloon_page once > > > for each message/element - but if all of those elements add back up to the full > > > page, perhaps it makes more sense to reassemble it there? > > > > That might work for 64k page size guests ... but for 4k guests, I think > > you'll have a hard time to reassemble a page there more easily than with > > my current approach. Or do you have a clever algorithm in mind that > > could do the job well there? > > No, i didn't; I just have an ulterior motive which is trying to > do as few madvise's as possible, and while virtio_balloon_handle_output sees > potentially quite a few requests at once, balloon_page is stuck down > there at the bottom without any idea of whether there are any more coming. > > Dave > > > Thomas > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 4/13/2016 8:21 PM, Thomas Huth wrote: > On 13.04.2016 15:15, Michael S. Tsirkin wrote: >> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: >>> The balloon code currently calls madvise() with TARGET_PAGE_SIZE >>> as length parameter, and an address which is directly based on >>> the page address supplied by the guest. Since the virtio-balloon >>> protocol is always based on 4k based addresses/sizes, no matter >>> what the host and guest are using as page size, this has a couple >>> of issues which could even lead to data loss in the worst case. >>> >>> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that >>> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than >>> 4k, we also destroy the 4k areas after the current one - which >>> might be wrong since the guest did not want free that area yet (in >>> case the guest used as smaller MMU page size than the hard-coded >>> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define >>> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size >>> parameter for the madvise() call instead. >> >> this makes absolute sense. >> >>> Then, there's yet another problem: If the host page size is bigger >>> than the 4k balloon page size, we can not simply call madvise() on >>> each of the 4k balloon addresses that we get from the guest - since >>> the madvise() always evicts the whole host page, not only a 4k area! >> >> Does it really round length up? >> Wow, it does: >> len = (len_in + ~PAGE_MASK) & PAGE_MASK; >> >> which seems to be undocumented, but has been there forever. > > Yes, that's ugly - I also had to take a look at the kernel sources to > understand what this call is supposed to do when being called with a > size < PAGE_SIZE. > >>> So in this case, we've got to track the 4k fragments of a host page >>> and only call madvise(DONTNEED) when all fragments have been collected. >>> This of course only works fine if the guest sends consecutive 4k >>> fragments - which is the case in the most important scenarios that >>> I try to address here (like a ppc64 guest with 64k page size that >>> is running on a ppc64 host with 64k page size). In case the guest >>> uses a page size that is smaller than the host page size, we might >>> need to add some more additional logic here later to increase the >>> probability of being able to release memory, but at least the guest >>> should now not crash anymore due to unintentionally evicted pages. > ... >>> static void virtio_balloon_instance_init(Object *obj) >>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h >>> index 35f62ac..04b7c0c 100644 >>> --- a/include/hw/virtio/virtio-balloon.h >>> +++ b/include/hw/virtio/virtio-balloon.h >>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { >>> int64_t stats_last_update; >>> int64_t stats_poll_interval; >>> uint32_t host_features; >>> + void *current_addr; >>> + unsigned long *fragment_bits; >>> + int fragment_bits_size; >>> } VirtIOBalloon; >>> >>> #endif >>> -- >>> 1.8.3.1 >> >> >> It looks like fragment_bits would have to be migrated. >> Which is a lot of complexity. > > I think we could simply omit this for now. In case somebody migrates the > VM while the ballooning is going on, we'd loose the information for one > host page, so we might miss one madvise(DONTNEED), but I think we could > live with that. > >> And work arounds for specific guest behaviour are really ugly. >> There are patches on-list to maintain a balloon bitmap - >> that will enable fixing it cleanly. > > Ah, good to know, I wasn't aware of them yet, so that will be a chance > for a really proper final solution, I hope. > >> How about we just skip madvise if host page size is > balloon >> page size, for 2.6? > > That would mean a regression compared to what we have today. Currently, > the ballooning is working OK for 64k guests on a 64k ppc host - rather > by chance than on purpose, but it's working. The guest is always sending > all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > for every one of them, but the kernel is ignoring madvise() on > non-64k-aligned addresses, so we end up with a situation where the > madvise() frees a whole 64k page which is also declared as free by the > guest. > > I think we should either take this patch as it is right now (without > adding extra code for migration) and later update it to the bitmap code > by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > fix it properly after the bitmap code has been applied. But disabling > the balloon code for 64k guests on 64k hosts completely does not sound > very appealing to me. What do you think? > please find v3 version of below set of balloon bitmap patch series posted on qemu-devel list. balloon: maintain bitmap for pages released by guest balloon driver. balloon: add balloon bitmap migration capability and setup bitmap migration status. balloon: reset balloon bitmap ramblock size on source and target. migration: skip scanning and migrating ram pages released by virtio-balloon driver. - Jitendra > Thomas >
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index c74101e..886faa8 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -35,13 +35,56 @@ #include "hw/virtio/virtio-bus.h" #include "hw/virtio/virtio-access.h" -static void balloon_page(void *addr, int deflate) +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) + +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) { #if defined(__linux__) - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || - kvm_has_sync_mmu())) { - qemu_madvise(addr, TARGET_PAGE_SIZE, - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); + size_t host_page_size; + void *aligned_addr; + + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) { + return; + } + + host_page_size = getpagesize(); + if (host_page_size <= BALLOON_PAGE_SIZE) { + qemu_madvise(addr, BALLOON_PAGE_SIZE, + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); + return; + } + + /* Host page size > ballon page size ==> use aligned host address */ + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); + if (deflate) { + /* MADV_WILLNEED is just a hint for the host kernel, the guest should + * also be able to use the memory again without this call, so let's + * only do it for the first, aligned fragment of a host page and + * ignore it for the others. + */ + if (addr == aligned_addr) { + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); + } + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } else { + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; + /* If we end up here, that means we want to evict balloon pages, but + * the host's page size is bigger than the 4k pages from the balloon. + * Since madvise() only works on the granularity of the host page size, + * we've got to collect all the 4k fragments from the guest first + * before we can issue the MADV_DONTNEED call. + */ + if (aligned_addr != s->current_addr) { + memset(s->fragment_bits, 0, s->fragment_bits_size); + s->current_addr = aligned_addr; + } + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); + memset(s->fragment_bits, 0, s->fragment_bits_size); + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } } #endif } @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) /* Using memory_region_get_ram_ptr is bending the rules a bit, but should be OK because we only want a single page. */ addr = section.offset_within_region; - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, !!(vq == s->dvq)); memory_region_unref(section.mr); } @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); + if (getpagesize() > BALLOON_PAGE_SIZE) { + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE + + sizeof(long) * 8 - 1) / 8; + s->fragment_bits = g_malloc0(s->fragment_bits_size); + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } + reset_stats(s); register_savevm(dev, "virtio-balloon", -1, 1, @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) VirtIODevice *vdev = VIRTIO_DEVICE(dev); VirtIOBalloon *s = VIRTIO_BALLOON(dev); + g_free(s->fragment_bits); balloon_stats_destroy_timer(s); qemu_remove_balloon_handler(s); unregister_savevm(dev, "virtio-balloon", s); @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev) g_free(s->stats_vq_elem); s->stats_vq_elem = NULL; } + + if (s->fragment_bits) { + memset(s->fragment_bits, 0, s->fragment_bits_size); + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } } static void virtio_balloon_instance_init(Object *obj) diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h index 35f62ac..04b7c0c 100644 --- a/include/hw/virtio/virtio-balloon.h +++ b/include/hw/virtio/virtio-balloon.h @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { int64_t stats_last_update; int64_t stats_poll_interval; uint32_t host_features; + void *current_addr; + unsigned long *fragment_bits; + int fragment_bits_size; } VirtIOBalloon; #endif
The balloon code currently calls madvise() with TARGET_PAGE_SIZE as length parameter, and an address which is directly based on the page address supplied by the guest. Since the virtio-balloon protocol is always based on 4k based addresses/sizes, no matter what the host and guest are using as page size, this has a couple of issues which could even lead to data loss in the worst case. TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that value for the madvise() call. If TARGET_PAGE_SIZE is bigger than 4k, we also destroy the 4k areas after the current one - which might be wrong since the guest did not want free that area yet (in case the guest used as smaller MMU page size than the hard-coded TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define called BALLOON_PAGE_SIZE (which is 4096) to use this as the size parameter for the madvise() call instead. Then, there's yet another problem: If the host page size is bigger than the 4k balloon page size, we can not simply call madvise() on each of the 4k balloon addresses that we get from the guest - since the madvise() always evicts the whole host page, not only a 4k area! So in this case, we've got to track the 4k fragments of a host page and only call madvise(DONTNEED) when all fragments have been collected. This of course only works fine if the guest sends consecutive 4k fragments - which is the case in the most important scenarios that I try to address here (like a ppc64 guest with 64k page size that is running on a ppc64 host with 64k page size). In case the guest uses a page size that is smaller than the host page size, we might need to add some more additional logic here later to increase the probability of being able to release memory, but at least the guest should now not crash anymore due to unintentionally evicted pages. Signed-off-by: Thomas Huth <thuth@redhat.com> --- I've tested this patch with both, a 4k page size ppc64 guest and a 64k page size ppc64 guest on a 64k page size ppc64 host. With this patch applied, I was not able to crash to crash the guests anymore (the 4k guest easily crashes without this patch). And looking at the output of the "free" command on the host, the ballooning also still works as expected. hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++---- include/hw/virtio/virtio-balloon.h | 3 ++ 2 files changed, 65 insertions(+), 6 deletions(-)