From patchwork Wed Apr 13 11:52:44 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Huth X-Patchwork-Id: 8821641 Return-Path: X-Original-To: patchwork-qemu-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 5880E9F3D1 for ; Wed, 13 Apr 2016 11:53:07 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id DA4D820160 for ; Wed, 13 Apr 2016 11:53:04 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C1CC020138 for ; Wed, 13 Apr 2016 11:53:03 +0000 (UTC) Received: from localhost ([::1]:42197 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqJLy-0005nb-Qp for patchwork-qemu-devel@patchwork.kernel.org; Wed, 13 Apr 2016 07:53:02 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53968) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqJLo-0005jw-R3 for qemu-devel@nongnu.org; Wed, 13 Apr 2016 07:52:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aqJLk-0001yb-PL for qemu-devel@nongnu.org; Wed, 13 Apr 2016 07:52:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44422) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqJLk-0001yE-HU for qemu-devel@nongnu.org; Wed, 13 Apr 2016 07:52:48 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6DA5E47048 for ; Wed, 13 Apr 2016 11:52:47 +0000 (UTC) Received: from thh440s.fritz.box (vpn1-6-186.ams2.redhat.com [10.36.6.186]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u3DBqiOp011436; Wed, 13 Apr 2016 07:52:45 -0400 From: Thomas Huth To: "Michael S. Tsirkin" , qemu-devel@nongnu.org Date: Wed, 13 Apr 2016 13:52:44 +0200 Message-Id: <1460548364-27469-1-git-send-email-thuth@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: amit.shah@redhat.com, dgibson@redhat.com, drjones@redhat.com, dgilbert@redhat.com, wehuang@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The balloon code currently calls madvise() with TARGET_PAGE_SIZE as length parameter, and an address which is directly based on the page address supplied by the guest. Since the virtio-balloon protocol is always based on 4k based addresses/sizes, no matter what the host and guest are using as page size, this has a couple of issues which could even lead to data loss in the worst case. TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that value for the madvise() call. If TARGET_PAGE_SIZE is bigger than 4k, we also destroy the 4k areas after the current one - which might be wrong since the guest did not want free that area yet (in case the guest used as smaller MMU page size than the hard-coded TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define called BALLOON_PAGE_SIZE (which is 4096) to use this as the size parameter for the madvise() call instead. Then, there's yet another problem: If the host page size is bigger than the 4k balloon page size, we can not simply call madvise() on each of the 4k balloon addresses that we get from the guest - since the madvise() always evicts the whole host page, not only a 4k area! So in this case, we've got to track the 4k fragments of a host page and only call madvise(DONTNEED) when all fragments have been collected. This of course only works fine if the guest sends consecutive 4k fragments - which is the case in the most important scenarios that I try to address here (like a ppc64 guest with 64k page size that is running on a ppc64 host with 64k page size). In case the guest uses a page size that is smaller than the host page size, we might need to add some more additional logic here later to increase the probability of being able to release memory, but at least the guest should now not crash anymore due to unintentionally evicted pages. Signed-off-by: Thomas Huth Reviewed-by: Andrew Jones --- I've tested this patch with both, a 4k page size ppc64 guest and a 64k page size ppc64 guest on a 64k page size ppc64 host. With this patch applied, I was not able to crash to crash the guests anymore (the 4k guest easily crashes without this patch). And looking at the output of the "free" command on the host, the ballooning also still works as expected. hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++---- include/hw/virtio/virtio-balloon.h | 3 ++ 2 files changed, 65 insertions(+), 6 deletions(-) diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index c74101e..886faa8 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -35,13 +35,56 @@ #include "hw/virtio/virtio-bus.h" #include "hw/virtio/virtio-access.h" -static void balloon_page(void *addr, int deflate) +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) + +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) { #if defined(__linux__) - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || - kvm_has_sync_mmu())) { - qemu_madvise(addr, TARGET_PAGE_SIZE, - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); + size_t host_page_size; + void *aligned_addr; + + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) { + return; + } + + host_page_size = getpagesize(); + if (host_page_size <= BALLOON_PAGE_SIZE) { + qemu_madvise(addr, BALLOON_PAGE_SIZE, + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); + return; + } + + /* Host page size > ballon page size ==> use aligned host address */ + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); + if (deflate) { + /* MADV_WILLNEED is just a hint for the host kernel, the guest should + * also be able to use the memory again without this call, so let's + * only do it for the first, aligned fragment of a host page and + * ignore it for the others. + */ + if (addr == aligned_addr) { + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); + } + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } else { + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; + /* If we end up here, that means we want to evict balloon pages, but + * the host's page size is bigger than the 4k pages from the balloon. + * Since madvise() only works on the granularity of the host page size, + * we've got to collect all the 4k fragments from the guest first + * before we can issue the MADV_DONTNEED call. + */ + if (aligned_addr != s->current_addr) { + memset(s->fragment_bits, 0, s->fragment_bits_size); + s->current_addr = aligned_addr; + } + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); + memset(s->fragment_bits, 0, s->fragment_bits_size); + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } } #endif } @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) /* Using memory_region_get_ram_ptr is bending the rules a bit, but should be OK because we only want a single page. */ addr = section.offset_within_region; - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, !!(vq == s->dvq)); memory_region_unref(section.mr); } @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp) s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); + if (getpagesize() > BALLOON_PAGE_SIZE) { + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE + + sizeof(long) * 8 - 1) / 8; + s->fragment_bits = g_malloc0(s->fragment_bits_size); + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } + reset_stats(s); register_savevm(dev, "virtio-balloon", -1, 1, @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) VirtIODevice *vdev = VIRTIO_DEVICE(dev); VirtIOBalloon *s = VIRTIO_BALLOON(dev); + g_free(s->fragment_bits); balloon_stats_destroy_timer(s); qemu_remove_balloon_handler(s); unregister_savevm(dev, "virtio-balloon", s); @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev) g_free(s->stats_vq_elem); s->stats_vq_elem = NULL; } + + if (s->fragment_bits) { + memset(s->fragment_bits, 0, s->fragment_bits_size); + s->current_addr = BALLOON_NO_CURRENT_PAGE; + } } static void virtio_balloon_instance_init(Object *obj) diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h index 35f62ac..04b7c0c 100644 --- a/include/hw/virtio/virtio-balloon.h +++ b/include/hw/virtio/virtio-balloon.h @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { int64_t stats_last_update; int64_t stats_poll_interval; uint32_t host_features; + void *current_addr; + unsigned long *fragment_bits; + int fragment_bits_size; } VirtIOBalloon; #endif