=====
This series depends on a kernel series for the virtio-balloon
driver called "add pressure notification via a new virtqueue".
You have to apply that series in your guest kernel to play
with automatic ballooning.
Then, on the QEMU side you can enable automatic ballooning with
the following command-line:
$ qemu [...] -device virtio-balloon,automatic=true
Algorithm
=========
On host pressure:
1. On boot QEMU registers for Linux kernel's vmpressure
event "low". This event is sent by the kernel when it
has started reclaiming memory. For more details, please
read Documentation/cgroups/memory.txt in the kernel's
source
2. When QEMU is notified on host pressure, it first checks
if the guest is currently in pressure, if it is then
the event is skipped. If the guest is not in pressure
QEMU asks the guest to inflate its balloon (32MB by
default)
NOTE: QEMU will update num_pages whenever an event
is received and the guest is not in pressure.
This means that if QEMU receives 10 events in
a row, num_pages will be updated to 320MB.
On guest pressure:
1. QEMU is notified by the virtio-balloon driver in the
guest (via message virtqueue) that the guest is under
pressure
2. QEMU checks if there's an inflate going on. If true,
QEMU rests num_pages to the current balloon value so
that the guest stops inflating (IOW, QEMU cancels
current inflation). QEMU returns
3. If there's no on-going inflate, QEMU asks the guest
to deflate (32MB by default)
4. Everytime a guest pressure notification is received,
QEMU sets a hysteresis period of 60 seconds. During
this period the guest is defined to be under pressure
(and inflates will be ignored)
FIXMEs/TODOs
============
- The number of pages to inflate/deflate and the memcg path
are harcoded. Will add command-line options for them
- The default value of 32MB for inflates/deflates is what
worked for me in my very specific test-case. This is
probably not a good default, but I don't how to define
a good one
- QEMU register's for vmpressure's level "low" notification.
The guest too will notify QEMU on "low" pressure in the
guest. The "low" notification is sent whenever the kernel
has started reclaiming memory. On the guest side this means
that it will only give free memory to the host. On the host
side this means that a host with lots of large freeable
caches will be considered as being in pressure.
There two ways to solve this:
1. Register for "medium" pressure instead of low. This
solves the problem above but it adds a different
one: medium is sent when the kernel has started to
swap, so it's a bit too late
2. Add a new event to vmpressure which is between
low and medium. The perfect event would be triggered
before waking up kswapd
- It would be nice (required?) to be able to dynamically
enable/disable automatic ballooning. With this patch you
enable it for the lifetime of the VM
- I think manual ballooning should be disabled when
automatic ballooning is enabled, but this is not done
yet
- This patch probably doesn't build on windows
Testing
=======
Testing is by far the most difficult aspect of this project to
me. It's been hard to find a good way to measure this work. So
take this with a grain of salt.
This is my test-case: a 2G host runs two VMs (guest A and
guest B), each with 1.3G of memory. When the VMs are fully
booted (but idle) the host has around 1.2G of free memory.
Then the VMs do the following:
1. Guest A runs ebizzy five times in a row, with a chunk
size of 1MB and the following number of chunks:
1024, 824, 624, 424, 224. IOW, the memory usage of
this VM is going down. Let's call it "vm-down"
2. Guest B runs ebizzy in similar manner, but it runs
ebizzy with the following number of chunks:
224, 424, 624, 824, 1024. IOW, the memory usage of
this VM is going up. Let's call it "vm-up"
Also, each ebizzy run takes 60 seconds. And the vm-up one
waits 60 seconds before running ebizzy for the first time.
This gives vm-down time to consume most of the host's pressure
and release it.
Here are the results. This is an avarage of three runs. We
measure host swap I/O, QEMU as a host process and perf.
info from the guest. Units:
- swap in/out: number of pages swapped
- Elapsed, user, sys: seconds
- total recs: total number of ebizzy records/s. This is
a sum of all ebizzy runs for a VM
vanilla
=======
Host
----
swap in: 36478.66
swap out: 372551.0
QEMU (as a process in the host)
-------------------------------
Elapsed user sys CPU% major f. minor f. total recs swap in swap out
vm-down: 395.42 309.60 3.72 79 2772.66 120046.66 4692.33 0 0
vm-up: 396.40 310 4.04 79 2053.66 208394.33 4684 0 0
Guest (ebizzy run in the guest)
-------------------------------
total recs swap in swap out
vm-down: 4692.33 0 0
vm-up: 4684 0 0
automatic balloon
=================
Host
----
swap in: 2.66
swap out: 8225.33
QEMU (as a process in the host)
-------------------------------
Elapsed user sys CPU% major f. minor f. total recs swap in swap out
vm-down: 387.95 309.66 3.43 80 106.66 29497.33 4710.66 0 0
vm-up: 388.79 310.98 4.35 81 63.66 110307 4704.33 2.67 822.66
Guest (ebizzy run in the guest)
-------------------------------
total recs swap in swap out
vm-down: 4710.66 0 0
vm-up: 4704.33 2.67 822.66
Some conclusions:
- The number of pages swapped in the host and the number of
QEMU's major faults is hugely reduced by automatic balloon
- Elapsed time is also better for the automatic balloon VMs,
vm-down run time as 1.89% lower and vm-up 1.92% lower
- The records/s is about the same for both, which I guess means
automatic balloon is not regressing this
- vm-up did swap a bit, not sure if this is a problem
Now the code, and I think I deserve a coffee after having wrote
all this stuff...
Signed-off-by: Luiz capitulino <lcapitulino@redhat.com>
---
hw/virtio/virtio-balloon.c | 180 +++++++++++++++++++++++++++++++++++++
hw/virtio/virtio-pci.c | 5 ++
hw/virtio/virtio-pci.h | 2 +
include/hw/virtio/virtio-balloon.h | 21 ++++-
4 files changed, 207 insertions(+), 1 deletion(-)
@@ -31,6 +31,139 @@
#include "hw/virtio/virtio-bus.h"
+#define LINUX_MEMCG_DEF_PATH "/sys/fs/cgroup/memory"
+#define AUTO_BALLOON_NR_PAGES ((32 * 1024 * 1024) >> VIRTIO_BALLOON_PFN_SHIFT)
+#define AUTO_BALLOON_PRESSURE_PERIOD 60
+
+void virtio_balloon_set_conf(DeviceState *dev, const VirtIOBalloonConf *bconf)
+{
+ VirtIOBalloon *s = VIRTIO_BALLOON(dev);
+ memcpy(&(s->bconf), bconf, sizeof(struct VirtIOBalloonConf));
+}
+
+static bool auto_balloon_enabled_cmdline(const VirtIOBalloon *s)
+{
+ return s->bconf.auto_balloon_enabled;
+}
+
+static bool guest_in_pressure(const VirtIOBalloon *s)
+{
+ time_t t = s->autob_last_guest_pressure;
+ return difftime(time(NULL), t) <= AUTO_BALLOON_PRESSURE_PERIOD;
+}
+
+static void inflate_guest(VirtIOBalloon *s)
+{
+ if (guest_in_pressure(s)) {
+ return;
+ }
+
+ s->num_pages += AUTO_BALLOON_NR_PAGES;
+ virtio_notify_config(VIRTIO_DEVICE(s));
+}
+
+static void deflate_guest(VirtIOBalloon *s)
+{
+ if (!s->autob_cur_size) {
+ return;
+ }
+
+ s->num_pages -= AUTO_BALLOON_NR_PAGES;
+ virtio_notify_config(VIRTIO_DEVICE(s));
+}
+
+static void virtio_balloon_handle_host_pressure(EventNotifier *ev)
+{
+ VirtIOBalloon *s = container_of(ev, VirtIOBalloon, event);
+
+ if (!event_notifier_test_and_clear(ev)) {
+ fprintf(stderr, "virtio-balloon: failed to drain the notify pipe\n");
+ return;
+ }
+
+ inflate_guest(s);
+}
+
+static void register_vmpressure(int cfd, int efd, int lfd, Error **errp)
+{
+ char *p;
+ ssize_t ret;
+
+ p = g_strdup_printf("%d %d low", efd, lfd);
+ ret = write(cfd, p, strlen(p));
+ if (ret < 0) {
+ error_setg_errno(errp, errno, "failed to write to control fd: %d", cfd);
+ } else {
+ g_assert(ret == strlen(p)); /* XXX: this should be always true, right? */
+ }
+
+ g_free(p);
+}
+
+static int open_file_in_dir(const char *dir_path, const char *file, mode_t mode,
+ Error **errp)
+{
+ char *p;
+ int fd;
+
+ p = g_strjoin("/", dir_path, file, NULL);
+ fd = qemu_open(p, mode);
+ if (fd < 0) {
+ error_setg_errno(errp, errno, "can't open '%s'", p);
+ }
+
+ g_free(p);
+ return fd;
+}
+
+static void automatic_balloon_init(VirtIOBalloon *s, const char *memcg_path,
+ Error **errp)
+{
+ Error *local_err = NULL;
+ int ret;
+
+ if (!memcg_path) {
+ memcg_path = LINUX_MEMCG_DEF_PATH;
+ }
+
+ s->lfd = open_file_in_dir(memcg_path, "memory.pressure_level", O_RDONLY,
+ &local_err);
+ if (local_err) {
+ goto out;
+ }
+
+ s->cfd = open_file_in_dir(memcg_path, "cgroup.event_control", O_WRONLY,
+ &local_err);
+ if (local_err) {
+ close(s->lfd);
+ goto out;
+ }
+
+ ret = event_notifier_init(&s->event, false);
+ if (ret < 0) {
+ error_setg_errno(&local_err, -ret, "failed to create event notifier");
+ goto out_err;
+ }
+
+ s->autob_last_guest_pressure = time(NULL) - (AUTO_BALLOON_PRESSURE_PERIOD+1);
+ event_notifier_set_handler(&s->event, virtio_balloon_handle_host_pressure);
+
+ register_vmpressure(s->cfd, event_notifier_get_fd(&s->event), s->lfd,
+ &local_err);
+ if (local_err) {
+ event_notifier_cleanup(&s->event);
+ goto out_err;
+ }
+
+ return;
+
+out_err:
+ close(s->lfd);
+ close(s->cfd);
+out:
+ error_propagate(errp, local_err);
+}
+
static void balloon_page(void *addr, int deflate)
{
#if defined(__linux__)
@@ -178,6 +311,34 @@ static void balloon_stats_set_poll_interval(Object *obj, struct Visitor *v,
balloon_stats_change_timer(s, 0);
}
+static void virtio_balloon_handle_msg(VirtIODevice *vdev, VirtQueue *vq)
+{
+ VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
+ VirtQueueElement elem;
+
+ while (virtqueue_pop(vq, &elem)) {
+ size_t offset = 0;
+ uint32_t msg;
+
+ while (iov_to_buf(elem.out_sg, elem.out_num, offset, &msg, 4) == 4) {
+ offset += 4;
+ msg = ldl_p(&msg);
+
+ if (msg == VIRTIO_BALLOON_MSG_PRESSURE) {
+ dev->autob_last_guest_pressure = time(NULL);
+ if (dev->num_pages > dev->autob_cur_size) {
+ /* cancel on-going inflation */
+ dev->num_pages = dev->autob_cur_size;
+ } else {
+ deflate_guest(dev);
+ }
+ }
+ }
+ virtqueue_push(vq, &elem, offset);
+ virtio_notify(vdev, vq);
+ }
+}
+
static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
{
VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -206,6 +367,12 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
!!(vq == s->dvq));
memory_region_unref(section.mr);
+
+ if (vq == s->ivq) {
+ s->autob_cur_size++;
+ } else {
+ s->autob_cur_size--;
+ }
}
virtqueue_push(vq, &elem, offset);
@@ -283,6 +450,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f)
{
f |= (1 << VIRTIO_BALLOON_F_STATS_VQ);
+ f |= (1 << VIRTIO_BALLOON_F_MESSAGE_VQ);
+
return f;
}
@@ -341,10 +510,20 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
{
VirtIODevice *vdev = VIRTIO_DEVICE(dev);
VirtIOBalloon *s = VIRTIO_BALLOON(dev);
+ Error *local_err = NULL;
int ret;
virtio_init(vdev, "virtio-balloon", VIRTIO_ID_BALLOON, 8);
+ if (auto_balloon_enabled_cmdline(s)) {
+ automatic_balloon_init(s, NULL /* default root memcg path */, &local_err);
+ if (local_err) {
+ virtio_cleanup(VIRTIO_DEVICE(s));
+ error_propagate(errp, local_err);
+ return;
+ }
+ }
+
ret = qemu_add_balloon_handler(virtio_balloon_to_target,
virtio_balloon_stat, s);
@@ -357,6 +536,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+ s->mvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_msg);
register_savevm(dev, "virtio-balloon", -1, 1,
virtio_balloon_save, virtio_balloon_load, s);
@@ -1276,6 +1276,9 @@ static void balloon_pci_stats_set_poll_interval(Object *obj, struct Visitor *v,
static Property virtio_balloon_pci_properties[] = {
DEFINE_VIRTIO_COMMON_FEATURES(VirtIOPCIProxy, host_features),
DEFINE_PROP_HEX32("class", VirtIOPCIProxy, class_code, 0),
+#ifdef __linux__
+ DEFINE_PROP_BIT("automatic", VirtIOBalloonPCI, bconf.auto_balloon_enabled, 0, false),
+#endif
DEFINE_PROP_END_OF_LIST(),
};
@@ -1289,6 +1292,8 @@ static int virtio_balloon_pci_init(VirtIOPCIProxy *vpci_dev)
vpci_dev->class_code = PCI_CLASS_OTHERS;
}
+ virtio_balloon_set_conf(vdev, &(dev->bconf));
+
qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
if (qdev_init(vdev) < 0) {
return -1;
@@ -144,6 +144,7 @@ struct VirtIOBlkPCI {
struct VirtIOBalloonPCI {
VirtIOPCIProxy parent_obj;
VirtIOBalloon vdev;
+ VirtIOBalloonConf bconf;
};
/*
@@ -156,6 +157,7 @@ struct VirtIOBalloonPCI {
struct VirtIOSerialPCI {
VirtIOPCIProxy parent_obj;
VirtIOSerial vdev;
+ VirtIOBalloonConf bconf;
};
/*
@@ -30,10 +30,19 @@
/* The feature bitmap for virtio balloon */
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory stats virtqueue */
+#define VIRTIO_BALLOON_F_MESSAGE_VQ 2 /* Message virtqueue */
+
+/* Messages supported by the message virtqueue */
+#define VIRTIO_BALLOON_MSG_PRESSURE 1
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
+typedef struct VirtIOBalloonConf
+{
+ uint32_t auto_balloon_enabled;
+} VirtIOBalloonConf;
+
struct virtio_balloon_config
{
/* Number of pages host wants Guest to give up. */
@@ -58,7 +67,7 @@ typedef struct VirtIOBalloonStat {
typedef struct VirtIOBalloon {
VirtIODevice parent_obj;
- VirtQueue *ivq, *dvq, *svq;
+ VirtQueue *ivq, *dvq, *svq, *mvq;
uint32_t num_pages;
uint32_t actual;
uint64_t stats[VIRTIO_BALLOON_S_NR];
@@ -67,6 +76,16 @@ typedef struct VirtIOBalloon {
QEMUTimer *stats_timer;
int64_t stats_last_update;
int64_t stats_poll_interval;
+
+ /* automatic ballooning */
+ int cfd;
+ int lfd;
+ EventNotifier event;
+ uint32_t autob_cur_size;
+ time_t autob_last_guest_pressure;
+ VirtIOBalloonConf bconf;
} VirtIOBalloon;
+void virtio_balloon_set_conf(DeviceState *dev, const VirtIOBalloonConf *bconf);
+
#endif