From patchwork Wed Apr 13 11:52:44 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Thomas Huth <thuth@redhat.com>
X-Patchwork-Id: 8821641
Return-Path: 
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
X-Original-To: patchwork-qemu-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 5880E9F3D1
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Wed, 13 Apr 2016 11:53:07 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id DA4D820160
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Wed, 13 Apr 2016 11:53:04 +0000 (UTC)
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id C1CC020138
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Wed, 13 Apr 2016 11:53:03 +0000 (UTC)
Received: from localhost ([::1]:42197 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>)
	id 1aqJLy-0005nb-Qp for patchwork-qemu-devel@patchwork.kernel.org;
	Wed, 13 Apr 2016 07:53:02 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53968)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <thuth@redhat.com>) id 1aqJLo-0005jw-R3
	for qemu-devel@nongnu.org; Wed, 13 Apr 2016 07:52:54 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <thuth@redhat.com>) id 1aqJLk-0001yb-PL
	for qemu-devel@nongnu.org; Wed, 13 Apr 2016 07:52:52 -0400
Received: from mx1.redhat.com ([209.132.183.28]:44422)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <thuth@redhat.com>) id 1aqJLk-0001yE-HU
	for qemu-devel@nongnu.org; Wed, 13 Apr 2016 07:52:48 -0400
Received: from int-mx09.intmail.prod.int.phx2.redhat.com
	(int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 6DA5E47048
	for <qemu-devel@nongnu.org>; Wed, 13 Apr 2016 11:52:47 +0000 (UTC)
Received: from thh440s.fritz.box (vpn1-6-186.ams2.redhat.com [10.36.6.186])
	by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with
	ESMTP id u3DBqiOp011436; Wed, 13 Apr 2016 07:52:45 -0400
From: Thomas Huth <thuth@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org
Date: Wed, 13 Apr 2016 13:52:44 +0200
Message-Id: <1460548364-27469-1-git-send-email-thuth@redhat.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 209.132.183.28
Subject: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host
	page sizes
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: amit.shah@redhat.com, dgibson@redhat.com, drjones@redhat.com,
	dgilbert@redhat.com, wehuang@redhat.com
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The balloon code currently calls madvise() with TARGET_PAGE_SIZE
as length parameter, and an address which is directly based on
the page address supplied by the guest. Since the virtio-balloon
protocol is always based on 4k based addresses/sizes, no matter
what the host and guest are using as page size, this has a couple
of issues which could even lead to data loss in the worst case.

TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
4k, we also destroy the 4k areas after the current one - which
might be wrong since the guest did not want free that area yet (in
case the guest used as smaller MMU page size than the hard-coded
TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
parameter for the madvise() call instead.

Then, there's yet another problem: If the host page size is bigger
than the 4k balloon page size, we can not simply call madvise() on
each of the 4k balloon addresses that we get from the guest - since
the madvise() always evicts the whole host page, not only a 4k area!
So in this case, we've got to track the 4k fragments of a host page
and only call madvise(DONTNEED) when all fragments have been collected.
This of course only works fine if the guest sends consecutive 4k
fragments - which is the case in the most important scenarios that
I try to address here (like a ppc64 guest with 64k page size that
is running on a ppc64 host with 64k page size). In case the guest
uses a page size that is smaller than the host page size, we might
need to add some more additional logic here later to increase the
probability of being able to release memory, but at least the guest
should now not crash anymore due to unintentionally evicted pages.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
---
 I've tested this patch with both, a 4k page size ppc64 guest
 and a 64k page size ppc64 guest on a 64k page size ppc64 host.
 With this patch applied, I was not able to crash to crash the
 guests anymore (the 4k guest easily crashes without this patch).
 And looking at the output of the "free" command on the host,
 the ballooning also still works as expected.

 hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
 include/hw/virtio/virtio-balloon.h |  3 ++
 2 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index c74101e..886faa8 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -35,13 +35,56 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 
-static void balloon_page(void *addr, int deflate)
+#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
+#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
+
+static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
 {
 #if defined(__linux__)
-    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
-                                         kvm_has_sync_mmu())) {
-        qemu_madvise(addr, TARGET_PAGE_SIZE,
-                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+    size_t host_page_size;
+    void *aligned_addr;
+
+    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
+        return;
+    }
+
+    host_page_size = getpagesize();
+    if (host_page_size <= BALLOON_PAGE_SIZE) {
+        qemu_madvise(addr, BALLOON_PAGE_SIZE,
+                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+        return;
+    }
+
+    /* Host page size > ballon page size ==> use aligned host address */
+    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
+    if (deflate) {
+        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
+         * also be able to use the memory again without this call, so let's
+         * only do it for the first, aligned fragment of a host page and
+         * ignore it for the others.
+         */
+        if (addr == aligned_addr) {
+            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
+        }
+        s->current_addr = BALLOON_NO_CURRENT_PAGE;
+    } else {
+        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
+        /* If we end up here, that means we want to evict balloon pages, but
+         * the host's page size is bigger than the 4k pages from the balloon.
+         * Since madvise() only works on the granularity of the host page size,
+         * we've got to collect all the 4k fragments from the guest first
+         * before we can issue the MADV_DONTNEED call.
+         */
+        if (aligned_addr != s->current_addr) {
+            memset(s->fragment_bits, 0, s->fragment_bits_size);
+            s->current_addr = aligned_addr;
+        }
+        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
+        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {
+            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
+            memset(s->fragment_bits, 0, s->fragment_bits_size);
+            s->current_addr = BALLOON_NO_CURRENT_PAGE;
+        }
     }
 #endif
 }
@@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             /* Using memory_region_get_ram_ptr is bending the rules a bit, but
                should be OK because we only want a single page.  */
             addr = section.offset_within_region;
-            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
+            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
                          !!(vq == s->dvq));
             memory_region_unref(section.mr);
         }
@@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
 
+    if (getpagesize() > BALLOON_PAGE_SIZE) {
+        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
+                                 + sizeof(long) * 8 - 1) / 8;
+        s->fragment_bits = g_malloc0(s->fragment_bits_size);
+        s->current_addr = BALLOON_NO_CURRENT_PAGE;
+    }
+
     reset_stats(s);
 
     register_savevm(dev, "virtio-balloon", -1, 1,
@@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOBalloon *s = VIRTIO_BALLOON(dev);
 
+    g_free(s->fragment_bits);
     balloon_stats_destroy_timer(s);
     qemu_remove_balloon_handler(s);
     unregister_savevm(dev, "virtio-balloon", s);
@@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
         g_free(s->stats_vq_elem);
         s->stats_vq_elem = NULL;
     }
+
+    if (s->fragment_bits) {
+        memset(s->fragment_bits, 0, s->fragment_bits_size);
+        s->current_addr = BALLOON_NO_CURRENT_PAGE;
+    }
 }
 
 static void virtio_balloon_instance_init(Object *obj)
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 35f62ac..04b7c0c 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
     int64_t stats_last_update;
     int64_t stats_poll_interval;
     uint32_t host_features;
+    void *current_addr;
+    unsigned long *fragment_bits;
+    int fragment_bits_size;
 } VirtIOBalloon;
 
 #endif