diff mbox

[v21,2/5,RESEND] virtio-balloon: VIRTIO_BALLOON_F_SG

Message ID 1515501687-7874-1-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Wei W Jan. 9, 2018, 12:41 p.m. UTC
Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
host.

The implementation of the previous virtio-balloon is not very efficient,
because the balloon pages are transferred to the host by one array each
time. Here is the breakdown of the time in percentage spent on each step
of the balloon inflating process (inflating 7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete. The above
profiling shows that the bottlenecks are stage 2) and stage 4).

This patch optimizes step 2) by transferring pages to host in sgs. An sg
describes a chunk of guest physically continuous pages. With this
mechanism, step 4) can also be optimized by doing address translation and
madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~460ms resulting
in an improvement of ~89%.

TODO:
- optimize stage 1) by allocating/freeing a chunk of pages instead of a
single page each time.
- sort the internal balloon page queue.
- enable OOM to free inflated pages maintained in the local temporary
  list.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/virtio/virtio_balloon.c     | 233 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 216 insertions(+), 18 deletions(-)

RESEND ChangeLog:
- fill_balloon(): move xb_set_page before balloon_page_enqueue

Comments

Tetsuo Handa Jan. 9, 2018, 2:42 p.m. UTC | #1
Wei Wang wrote:
> - enable OOM to free inflated pages maintained in the local temporary
>   list.

I do want to see it before applying this patch.

Please carefully check how the xbitmap implementation works, and you will
find that you are adding a lot of redundant operations with a bug.
Wang, Wei W Jan. 10, 2018, 10:26 a.m. UTC | #2
On 01/09/2018 10:42 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> - enable OOM to free inflated pages maintained in the local temporary
>>    list.
> I do want to see it before applying this patch.


Fine, then what do you think of the method I shared in your post here: 
https://patchwork.kernel.org/patch/10140731/

Michael, could we merge patch 3-5 first?


>
> Please carefully check how the xbitmap implementation works, and you will
> find that you are adding a lot of redundant operations with a bug.

This version mainly added some test cases, and it passes the test run 
without any issue. Appreciate it if your comments could be more 
specific, that would make the discussion more effective, for example, I 
deliberately added "xb_find_set(xb1, 2, ULONG_MAX - 3)" for the overflow 
test, not sure if this is the "bug" you referred to, but I'm glad to 
hear your different thought.

I agree that some tests may be repeated in some degree, since we test 
the implementation from different aspects, for example, 
xbitmap_check_bit_range() may have already performed xb_zero() while we 
specifically have another xbitmap_check_zero_bits() which may test 
something that has already been tested when checking bit range. But I 
think testing twice is better than omission.
Also, I left the "Regualr test1: node=NULL" case though the new 
implementation doesn't explicitly use "node" as before, but that 
node=NULL is still a radix tree implementation internally and that case 
looks special to me, so maybe not bad to cover in the test.

You are also welcome to send a patch to remove the redundant one if you 
think that's an issue. Thanks.

Best,
Wei
Tetsuo Handa Jan. 11, 2018, 11:06 a.m. UTC | #3
Wei Wang wrote:
> Michael, could we merge patch 3-5 first?

No! I'm repeatedly asking you to propose only VIRTIO_BALLOON_F_SG changes.
Please don't ignore me.



Patch 4 depends on patch 2. Thus, back to patch 2.

Your patch is trying to switch tell_host_sgs() and tell_host() based on
VIRTIO_BALLOON_F_SG.

----------------------------------------
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
----------------------------------------

tell_host() uses sg_init_one()/virtqueue_add_outbuf() sequence for
telling to the host

----------------------------------------
static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
{
	struct scatterlist sg;
	unsigned int len;

	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);

	/* We should always be able to add one buffer to an empty queue. */
	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
	virtqueue_kick(vq);

	/* When host has read buffer, this completes via balloon_ack */
	wait_event(vb->acked, virtqueue_get_buf(vq, &len));

}
----------------------------------------

while add_one_sg() from batch_balloon_page_sg() from tell_host_sgs() uses
sg_init_one()/virtqueue_add_inbuf() sequence for telling to the host.
Why the direction becomes opposite (inbuf versus outbuf) ?

----------------------------------------
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static void add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
+{
+	struct scatterlist sg;
+	unsigned int unused;
+	int err;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	err = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the sg will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
+}
+
+static void batch_balloon_page_sg(struct virtio_balloon *vb,
+				  struct virtqueue *vq,
+				  unsigned long pfn,
+				  uint32_t len)
+{
+	add_one_sg(vq, pfn, len);
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long pfn_start, pfn_end;
+	uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
+	uint64_t len;
+
+	pfn_start = page_xb_start;
+	while (pfn_start < page_xb_end) {
+		if (!xb_find_set(&vb->page_xb, page_xb_end, &pfn_start))
+			break;
+		pfn_end = pfn_start + 1;
+		if (!xb_find_zero(&vb->page_xb, page_xb_end, &pfn_end))
+			pfn_end = page_xb_end + 1;
+		len = (pfn_end - pfn_start) << PAGE_SHIFT;
+		while (len > max_len) {
+			batch_balloon_page_sg(vb, vq, pfn_start, max_len);
+			pfn_start += max_len >> PAGE_SHIFT;
+			len -= max_len;
+		}
+		batch_balloon_page_sg(vb, vq, pfn_start, (uint32_t)len);
+		pfn_start = pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_zero(&vb->page_xb, page_xb_start, page_xb_end);
+}
----------------------------------------

Where does inefficiency of !VIRTIO_BALLOON_F_SG path come from?
Does it come from the limitation that tell_host() has to call wait_event()
for every 256 pages?

Is the value 256 required by the communication protocol? If no, why can't we use
larger values? If yes, is the sg_init_one()/virtqueue_add_outbuf()/virtqueue_kick(vq)/wait_event()
sequence required by the communication protocol?

If no, why can't we call sg_init_one()/virtqueue_add_outbuf() sequence for N times and
call kick_and_wait() only when vq->num_free == 0 or reached the last page?

Even if the sg_init_one()/virtqueue_add_outbuf()/virtqueue_kick(vq)/wait_event()
sequence is required by the communication protocol, we can consider introducing
(sg_init_one()/virtqueue_add_outbuf()) * N + kick_and_wait() sequence instead of
VIRTIO_BALLOON_F_SG flag. Then, we don't need patch 1 which needs to worry about
memory allocation failure cases.

> The implementation of the previous virtio-balloon is not very efficient,
> because the balloon pages are transferred to the host by one array each
> time. Here is the breakdown of the time in percentage spent on each step
> of the balloon inflating process (inflating 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete. The above
> profiling shows that the bottlenecks are stage 2) and stage 4).
> 
> This patch optimizes step 2) by transferring pages to host in sgs. An sg
> describes a chunk of guest physically continuous pages. With this
> mechanism, step 4) can also be optimized by doing address translation and
> madvise() in chunks rather than page by page.

Who does 4) ? The host's kernel code? The host's hypervisor code (i.e. QEMU)?
The guest's kernel code? The guest's userspace code? If it is not the kernel
code, sorting PFNs in userspace will be very efficient.



Now, proceeding to patch 4.

Your patch is trying to call add_one_sg() for multiple times based on

----------------------------------------
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the sg will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
----------------------------------------

assumption holds. But why can such assumption hold true?
This add_one_sg() is called from batch_free_page_sg() as the callback of

----------------------------------------
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
----------------------------------------

call. But if you see patch 3, the callback function is called with irq disabled.
That is, the comment

----------------------------------------
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
----------------------------------------

in patch 3 is ignored in patch 4. This add_one_sg() assumes that there is always
at least 1 entry available on the vq. But if the vq is full, batch_free_page_sg()
cannot sleep because walk_free_page_list() disabled irq before calling
batch_free_page_sg().

----------------------------------------
+static void batch_free_page_sg(struct virtqueue *vq,
+			       unsigned long pfn,
+			       uint32_t len)
+{
+	add_one_sg(vq, pfn, len);
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		virtqueue_kick(vq);
+}
----------------------------------------

Thus, the assumption patch 4 depends on cannot hold true. What the callback
can do is to just copy snapshot of free memory blocks to some buffer; you
need to defer sending that buffer to the host.



Now, I suspect we need to add VIRTIO_BALLOON_F_FREE_PAGE_VQ flag. I want to see
the patch for the hypervisor side which makes use of VIRTIO_BALLOON_F_FREE_PAGE_VQ
flag because its usage becomes tricky. Between the guest kernel obtains snapshot of
free memory blocks and the hypervisor is told that some pages are currently free,
these pages can become in use. That is, I don't think

  The second feature enables the optimization of the 1st round memory
  transfer - the hypervisor can skip the transfer of guest free pages in the
  1st round.

is accurate. The hypervisor is allowed to mark pages which are told as "currently
unused" by the guest kernel as "write-protected" before starting the 1st round.
Then, the hypervisor performs copying all pages except write-protected pages as
the 1st round. Then, the 2nd and later rounds will be the same. That is,
VIRTIO_BALLOON_F_FREE_PAGE_VQ requires the hypervisor to do 0th round as
preparation. Thus, I want to see the patch for the hypervisor side.

Now, what if all free pages in the guest kernel were reserved as ballooned pages?
There will be no free pages which VIRTIO_BALLOON_F_FREE_PAGE_VQ flag would help.
The hypervisor will have to copy all pages because all pages are either currently
in-use or currently in balloons. After ballooning to appropriate size, there will
be little free memory in the guest kernel, and the hypervisor already knows which
pages are in the balloon. Thus, the hypervisor can skip copying the content of
pages in the balloon, without using VIRTIO_BALLOON_F_FREE_PAGE_VQ flag.

Then, why can't we do "inflate the balloon up to reasonable level (e.g. no need to
wait for reclaim and no need to deflate)" instead of "find all the free pages as of
specific moment" ? That is, code for VIRTIO_BALLOON_F_DEFLATE_ON_OOM could be reused
instead of VIRTIO_BALLOON_F_FREE_PAGE_VQ ?



> You are also welcome to send a patch to remove the redundant one if you 
> think that's an issue. Thanks.

Hint 1: Check possible -EXXX values the xbitmap API can return.

Hint 2: Prepare for the worst case where ballooning can allocate no memory for SG.

Note that I can't write a patch because I can't test the patch because I don't
have the patch for the hypervisor side.
Wang, Wei W Jan. 12, 2018, 9:14 a.m. UTC | #4
On 01/11/2018 07:06 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> Michael, could we merge patch 3-5 first?
> No! I'm repeatedly asking you to propose only VIRTIO_BALLOON_F_SG changes.
> Please don't ignore me.
>
>
>
> Patch 4 depends on patch 2. Thus, back to patch 2.

There is not strict dependence per se. I plan to split the two features 
into 2 series, and post out 3-5 first, and the corresponding hypervisor 
code.
After that's done, I'll get back to the discussion of patch 2.




> Now, proceeding to patch 4.
>
> Your patch is trying to call add_one_sg() for multiple times based on
>
> ----------------------------------------
> +	/*
> +	 * This is expected to never fail: there is always at least 1 entry
> +	 * available on the vq, because when the vq is full the worker thread
> +	 * that adds the sg will be put into sleep until at least 1 entry is
> +	 * available to use.
> +	 */

This will be more clear in the new version which is not together with 
patch 2.


>
> Now, I suspect we need to add VIRTIO_BALLOON_F_FREE_PAGE_VQ flag. I want to see
> the patch for the hypervisor side which makes use of VIRTIO_BALLOON_F_FREE_PAGE_VQ
> flag because its usage becomes tricky. Between the guest kernel obtains snapshot of
> free memory blocks and the hypervisor is told that some pages are currently free,
> these pages can become in use. That is, I don't think
>
>    The second feature enables the optimization of the 1st round memory
>    transfer - the hypervisor can skip the transfer of guest free pages in the
>    1st round.
>
> is accurate. The hypervisor is allowed to mark pages which are told as "currently
> unused" by the guest kernel as "write-protected" before starting the 1st round.
> Then, the hypervisor performs copying all pages except write-protected pages as
> the 1st round. Then, the 2nd and later rounds will be the same. That is,
> VIRTIO_BALLOON_F_FREE_PAGE_VQ requires the hypervisor to do 0th round as
> preparation. Thus, I want to see the patch for the hypervisor side.
>
> Now, what if all free pages in the guest kernel were reserved as ballooned pages?
> There will be no free pages which VIRTIO_BALLOON_F_FREE_PAGE_VQ flag would help.
> The hypervisor will have to copy all pages because all pages are either currently
> in-use or currently in balloons. After ballooning to appropriate size, there will
> be little free memory in the guest kernel, and the hypervisor already knows which
> pages are in the balloon. Thus, the hypervisor can skip copying the content of
> pages in the balloon, without using VIRTIO_BALLOON_F_FREE_PAGE_VQ flag.
>
> Then, why can't we do "inflate the balloon up to reasonable level (e.g. no need to
> wait for reclaim and no need to deflate)" instead of "find all the free pages as of
> specific moment" ? That is, code for VIRTIO_BALLOON_F_DEFLATE_ON_OOM could be reused
> instead of VIRTIO_BALLOON_F_FREE_PAGE_VQ ?
>

I think you misunderstood the work, which seems not easy to explain 
everything from the beginning here. I wish to review patch 4 (I'll send 
out a new independent version) with Michael if possible.
I'll discuss with you about patch 2 later. Thanks.

Best,
Wei
diff mbox

Patch

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a1fb52c..d0b8ea0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@ 
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@  struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,128 @@  static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static void add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
+{
+	struct scatterlist sg;
+	unsigned int unused;
+	int err;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	err = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the sg will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
+}
+
+static void batch_balloon_page_sg(struct virtio_balloon *vb,
+				  struct virtqueue *vq,
+				  unsigned long pfn,
+				  uint32_t len)
+{
+	add_one_sg(vq, pfn, len);
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long pfn_start, pfn_end;
+	uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
+	uint64_t len;
+
+	pfn_start = page_xb_start;
+	while (pfn_start < page_xb_end) {
+		if (!xb_find_set(&vb->page_xb, page_xb_end, &pfn_start))
+			break;
+		pfn_end = pfn_start + 1;
+		if (!xb_find_zero(&vb->page_xb, page_xb_end, &pfn_end))
+			pfn_end = page_xb_end + 1;
+		len = (pfn_end - pfn_start) << PAGE_SHIFT;
+		while (len > max_len) {
+			batch_balloon_page_sg(vb, vq, pfn_start, max_len);
+			pfn_start += max_len >> PAGE_SHIFT;
+			len -= max_len;
+		}
+		batch_balloon_page_sg(vb, vq, pfn_start, (uint32_t)len);
+		pfn_start = pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_zero(&vb->page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int ret;
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+
+	do {
+		if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
+			return -ENOMEM;
+
+		ret = xb_set_bit(&vb->page_xb, pfn);
+		xb_preload_end();
+	} while (unlikely(ret == -EAGAIN));
+
+	return ret;
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	unsigned num_allocated_pages;
 	unsigned num_pfns;
 	struct page *page;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	for (num_pfns = 0; num_pfns < num;
 	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
@@ -172,9 +290,16 @@  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	vb->num_pfns = 0;
 
 	while ((page = balloon_page_pop(&pages))) {
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				__free_page(page);
+				continue;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		balloon_page_enqueue(&vb->vb_dev_info, page);
 
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -184,8 +309,12 @@  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -211,9 +340,12 @@  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -223,7 +355,14 @@  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				balloon_page_enqueue(&vb->vb_dev_info, page);
+				break;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -234,13 +373,55 @@  static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
 
+/*
+ * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
+ * for xbitmap, which is not suitable for the oom case. This function does not
+ * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
+ * pages when VIRTIO_BALLOON_F_SG is negotiated.
+ */
+static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
+{
+	unsigned int n;
+	struct page *page;
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	struct virtqueue *vq = vb->deflate_vq;
+	LIST_HEAD(pages);
+
+	mutex_lock(&vb->balloon_lock);
+	for (n = 0; n < oom_pages; n++) {
+		page = balloon_page_dequeue(vb_dev_info);
+		if (!page)
+			break;
+
+		list_add(&page->lru, &pages);
+		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+		batch_balloon_page_sg(vb, vb->deflate_vq, page_to_pfn(page),
+				      PAGE_SIZE);
+		release_pages_balloon(vb, &pages);
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+	mutex_unlock(&vb->balloon_lock);
+
+	return n;
+}
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
@@ -380,7 +561,10 @@  static int virtballoon_oom_notify(struct notifier_block *self,
 		return NOTIFY_OK;
 
 	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
+		num_freed_pages = leak_balloon_sg_oom(vb);
+	else
+		num_freed_pages = leak_balloon(vb, oom_pages);
 	update_balloon_size(vb);
 	*freed += num_freed_pages;
 
@@ -477,6 +661,7 @@  static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -498,16 +683,24 @@  static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		add_one_sg(vb->inflate_vq, page_to_pfn(newpage), PAGE_SIZE);
+		kick_and_wait(vb->inflate_vq, vb->acked);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		add_one_sg(vb->deflate_vq, page_to_pfn(page), PAGE_SIZE);
+		kick_and_wait(vb->deflate_vq, vb->acked);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -566,6 +759,9 @@  static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -682,6 +878,7 @@  static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@ 
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12