From patchwork Wed Jul 10 19:51:57 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nitesh Narayan Lal X-Patchwork-Id: 11038869 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 853C31395 for ; Wed, 10 Jul 2019 19:52:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 75F2B28765 for ; Wed, 10 Jul 2019 19:52:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6A08C289DA; Wed, 10 Jul 2019 19:52:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8041C289CB for ; Wed, 10 Jul 2019 19:52:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728163AbfGJTw3 (ORCPT ); Wed, 10 Jul 2019 15:52:29 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33552 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725911AbfGJTw2 (ORCPT ); Wed, 10 Jul 2019 15:52:28 -0400 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3500430C0DFE; Wed, 10 Jul 2019 19:52:27 +0000 (UTC) Received: from virtlab512.virt.lab.eng.bos.redhat.com (virtlab512.virt.lab.eng.bos.redhat.com [10.19.152.206]) by smtp.corp.redhat.com (Postfix) with ESMTP id 686A219C70; Wed, 10 Jul 2019 19:52:25 +0000 (UTC) From: Nitesh Narayan Lal To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, david@redhat.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com, john.starks@microsoft.com, dave.hansen@intel.com, mhocko@suse.com Subject: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Date: Wed, 10 Jul 2019 15:51:57 -0400 Message-Id: <20190710195158.19640-2-nitesh@redhat.com> In-Reply-To: <20190710195158.19640-1-nitesh@redhat.com> References: <20190710195158.19640-1-nitesh@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.45]); Wed, 10 Jul 2019 19:52:27 +0000 (UTC) Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This patch introduces the core infrastructure for free page hinting in virtual environments. It enables the kernel to track the free pages which can be reported to its hypervisor so that the hypervisor could free and reuse that memory as per its requirement. While the pages are getting processed in the hypervisor (e.g., via MADV_FREE), the guest must not use them, otherwise, data loss would be possible. To avoid such a situation, these pages are temporarily removed from the buddy. The amount of pages removed temporarily from the buddy is governed by the backend(virtio-balloon in our case). To efficiently identify free pages that can to be hinted to the hypervisor, bitmaps in a coarse granularity are used. Only fairly big chunks are reported to the hypervisor - especially, to not break up THP in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits in the bitmap are an indication whether a page *might* be free, not a guarantee. A new hook after buddy merging sets the bits. Bitmaps are stored per zone, protected by the zone lock. A workqueue asynchronously processes the bitmaps, trying to isolate and report pages that are still free. The backend (virtio-balloon) is responsible for reporting these batched pages to the host synchronously. Once reporting/ freeing is complete, isolated pages are returned back to the buddy. There are still various things to look into (e.g., memory hotplug, more efficient locking, possible races when disabling). Signed-off-by: Nitesh Narayan Lal --- include/linux/page_hinting.h | 45 +++++++ mm/Kconfig | 6 + mm/Makefile | 1 + mm/page_alloc.c | 18 +-- mm/page_hinting.c | 250 +++++++++++++++++++++++++++++++++++ 5 files changed, 312 insertions(+), 8 deletions(-) create mode 100644 include/linux/page_hinting.h create mode 100644 mm/page_hinting.c diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h new file mode 100644 index 000000000000..4900feb796f9 --- /dev/null +++ b/include/linux/page_hinting.h @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PAGE_HINTING_H +#define _LINUX_PAGE_HINTING_H + +/* + * Minimum page order required for a page to be hinted to the host. + */ +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) + +/* + * struct page_hinting_config: holds the information supplied by the balloon + * device to page hinting. + * @hint_pages: Callback which reports the isolated pages + * synchornously to the host. + * @max_pages: Maxmimum pages that are going to be hinted to the host + * at a time of granularity >= PAGE_HINTING_MIN_ORDER. + */ +struct page_hinting_config { + void (*hint_pages)(struct list_head *list); + int max_pages; +}; + +extern int __isolate_free_page(struct page *page, unsigned int order); +extern void __free_one_page(struct page *page, unsigned long pfn, + struct zone *zone, unsigned int order, + int migratetype, bool hint); +#ifdef CONFIG_PAGE_HINTING +void page_hinting_enqueue(struct page *page, int order); +int page_hinting_enable(const struct page_hinting_config *conf); +void page_hinting_disable(void); +#else +static inline void page_hinting_enqueue(struct page *page, int order) +{ +} + +static inline int page_hinting_enable(const struct page_hinting_config *conf) +{ + return -EOPNOTSUPP; +} + +static inline void page_hinting_disable(void) +{ +} +#endif +#endif /* _LINUX_PAGE_HINTING_H */ diff --git a/mm/Kconfig b/mm/Kconfig index f0c76ba47695..e97fab429d9b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -765,4 +765,10 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool +# PAGE_HINTING will allow the guest to report the free pages to the +# host in fixed chunks as soon as the threshold is reached. +config PAGE_HINTING + bool + def_bool n + depends on X86_64 endmenu diff --git a/mm/Makefile b/mm/Makefile index ac5e5ba78874..73be49177656 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d66bc8abe0af..8a44338bd04e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -69,6 +69,7 @@ #include #include #include +#include #include #include @@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page, * -- nyc */ -static inline void __free_one_page(struct page *page, +inline void __free_one_page(struct page *page, unsigned long pfn, struct zone *zone, unsigned int order, - int migratetype) + int migratetype, bool hint) { unsigned long combined_pfn; unsigned long uninitialized_var(buddy_pfn); @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page, migratetype); else add_to_free_area(page, &zone->free_area[order], migratetype); - + if (hint) + page_hinting_enqueue(page, order); } /* @@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (unlikely(isolated_pageblocks)) mt = get_pageblock_migratetype(page); - __free_one_page(page, page_to_pfn(page), zone, 0, mt); + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); trace_mm_page_pcpu_drain(page, 0, mt); } spin_unlock(&zone->lock); @@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count, static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn, unsigned int order, - int migratetype) + int migratetype, bool hint) { spin_lock(&zone->lock); if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(migratetype))) { migratetype = get_pfnblock_migratetype(page, pfn); } - __free_one_page(page, pfn, zone, order, migratetype); + __free_one_page(page, pfn, zone, order, migratetype, hint); spin_unlock(&zone->lock); } @@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) migratetype = get_pfnblock_migratetype(page, pfn); local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); - free_one_page(page_zone(page), page, pfn, order, migratetype); + free_one_page(page_zone(page), page, pfn, order, migratetype, true); local_irq_restore(flags); } @@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) */ if (migratetype >= MIGRATE_PCPTYPES) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(zone, page, pfn, 0, migratetype); + free_one_page(zone, page, pfn, 0, migratetype, true); return; } migratetype = MIGRATE_MOVABLE; diff --git a/mm/page_hinting.c b/mm/page_hinting.c new file mode 100644 index 000000000000..0bfa09f8c3ed --- /dev/null +++ b/mm/page_hinting.c @@ -0,0 +1,250 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Page hinting core infrastructure to enable a VM to report free pages to its + * hypervisor. + * + * Copyright Red Hat, Inc. 2019 + * + * Author(s): Nitesh Narayan Lal + */ + +#include +#include +#include +#include + +/* + * struct zone_free_area: For a single zone across NUMA nodes, it holds the + * bitmap pointer to track the free pages and other required parameters + * used to recover these pages by scanning the bitmap. + * @bitmap: Pointer to the bitmap in PAGE_HINTING_MIN_ORDER + * granularity. + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. + * @end_pfn: Indicates the last PFN value for the zone. + * @free_pages: Tracks the number of free pages of granularity + * PAGE_HINTING_MIN_ORDER. + * @nbits: Indicates the total size of the bitmap in bits allocated + * at the time of initialization. + */ +struct zone_free_area { + unsigned long *bitmap; + unsigned long base_pfn; + unsigned long end_pfn; + atomic_t free_pages; + unsigned long nbits; +} free_area[MAX_NR_ZONES]; + +static void init_hinting_wq(struct work_struct *work); +static DEFINE_MUTEX(page_hinting_init); +const struct page_hinting_config *page_hitning_conf; +struct work_struct hinting_work; +atomic_t page_hinting_active; + +void free_area_cleanup(int nr_zones) +{ + int zone_idx; + + for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) { + bitmap_free(free_area[zone_idx].bitmap); + free_area[zone_idx].base_pfn = 0; + free_area[zone_idx].end_pfn = 0; + free_area[zone_idx].nbits = 0; + atomic_set(&free_area[zone_idx].free_pages, 0); + } +} + +int page_hinting_enable(const struct page_hinting_config *conf) +{ + unsigned long bitmap_size = 0; + int zone_idx = 0, ret = -EBUSY; + struct zone *zone; + + mutex_lock(&page_hinting_init); + if (!page_hitning_conf) { + for_each_populated_zone(zone) { + zone_idx = zone_idx(zone); +#ifdef CONFIG_ZONE_DEVICE + if (zone_idx == ZONE_DEVICE) + continue; +#endif + spin_lock(&zone->lock); + if (free_area[zone_idx].base_pfn) { + free_area[zone_idx].base_pfn = + min(free_area[zone_idx].base_pfn, + zone->zone_start_pfn); + free_area[zone_idx].end_pfn = + max(free_area[zone_idx].end_pfn, + zone->zone_start_pfn + + zone->spanned_pages); + } else { + free_area[zone_idx].base_pfn = + zone->zone_start_pfn; + free_area[zone_idx].end_pfn = + zone->zone_start_pfn + + zone->spanned_pages; + } + spin_unlock(&zone->lock); + } + + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { + unsigned long pages = free_area[zone_idx].end_pfn - + free_area[zone_idx].base_pfn; + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; + if (!bitmap_size) + continue; + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, + GFP_KERNEL); + if (!free_area[zone_idx].bitmap) { + free_area_cleanup(zone_idx); + mutex_unlock(&page_hinting_init); + return -ENOMEM; + } + free_area[zone_idx].nbits = bitmap_size; + } + page_hitning_conf = conf; + INIT_WORK(&hinting_work, init_hinting_wq); + ret = 0; + } + mutex_unlock(&page_hinting_init); + return ret; +} +EXPORT_SYMBOL_GPL(page_hinting_enable); + +void page_hinting_disable(void) +{ + cancel_work_sync(&hinting_work); + page_hitning_conf = NULL; + free_area_cleanup(MAX_NR_ZONES); +} +EXPORT_SYMBOL_GPL(page_hinting_disable); + +static unsigned long pfn_to_bit(struct page *page, int zone_idx) +{ + unsigned long bitnr; + + bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn) + >> PAGE_HINTING_MIN_ORDER; + return bitnr; +} + +static void release_buddy_pages(struct list_head *pages) +{ + int mt = 0, zone_idx, order; + struct page *page, *next; + unsigned long bitnr; + struct zone *zone; + + list_for_each_entry_safe(page, next, pages, lru) { + zone_idx = page_zonenum(page); + zone = page_zone(page); + bitnr = pfn_to_bit(page, zone_idx); + spin_lock(&zone->lock); + list_del(&page->lru); + order = page_private(page); + set_page_private(page, 0); + mt = get_pageblock_migratetype(page); + __free_one_page(page, page_to_pfn(page), zone, + order, mt, false); + spin_unlock(&zone->lock); + } +} + +static void bm_set_pfn(struct page *page) +{ + struct zone *zone = page_zone(page); + int zone_idx = page_zonenum(page); + unsigned long bitnr = 0; + + lockdep_assert_held(&zone->lock); + bitnr = pfn_to_bit(page, zone_idx); + /* + * TODO: fix possible underflows. + */ + if (free_area[zone_idx].bitmap && + bitnr < free_area[zone_idx].nbits && + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) + atomic_inc(&free_area[zone_idx].free_pages); +} + +static void scan_zone_free_area(int zone_idx, int free_pages) +{ + int ret = 0, order, isolated_cnt = 0; + unsigned long set_bit, start = 0; + LIST_HEAD(isolated_pages); + struct page *page; + struct zone *zone; + + for (;;) { + ret = 0; + set_bit = find_next_bit(free_area[zone_idx].bitmap, + free_area[zone_idx].nbits, start); + if (set_bit >= free_area[zone_idx].nbits) + break; + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + + free_area[zone_idx].base_pfn); + if (!page) + continue; + zone = page_zone(page); + spin_lock(&zone->lock); + + if (PageBuddy(page) && page_private(page) >= + PAGE_HINTING_MIN_ORDER) { + order = page_private(page); + ret = __isolate_free_page(page, order); + } + clear_bit(set_bit, free_area[zone_idx].bitmap); + atomic_dec(&free_area[zone_idx].free_pages); + spin_unlock(&zone->lock); + if (ret) { + /* + * restoring page order to use it while releasing + * the pages back to the buddy. + */ + set_page_private(page, order); + list_add_tail(&page->lru, &isolated_pages); + isolated_cnt++; + if (isolated_cnt == page_hitning_conf->max_pages) { + page_hitning_conf->hint_pages(&isolated_pages); + release_buddy_pages(&isolated_pages); + isolated_cnt = 0; + } + } + start = set_bit + 1; + } + if (isolated_cnt) { + page_hitning_conf->hint_pages(&isolated_pages); + release_buddy_pages(&isolated_pages); + } +} + +static void init_hinting_wq(struct work_struct *work) +{ + int zone_idx, free_pages; + + atomic_set(&page_hinting_active, 1); + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { + free_pages = atomic_read(&free_area[zone_idx].free_pages); + if (free_pages >= page_hitning_conf->max_pages) + scan_zone_free_area(zone_idx, free_pages); + } + atomic_set(&page_hinting_active, 0); +} + +void page_hinting_enqueue(struct page *page, int order) +{ + int zone_idx; + + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) + return; + + bm_set_pfn(page); + if (atomic_read(&page_hinting_active)) + return; + zone_idx = zone_idx(page_zone(page)); + if (atomic_read(&free_area[zone_idx].free_pages) >= + page_hitning_conf->max_pages) { + int cpu = smp_processor_id(); + + queue_work_on(cpu, system_wq, &hinting_work); + } +} From patchwork Wed Jul 10 19:51:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nitesh Narayan Lal X-Patchwork-Id: 11038871 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7A544138D for ; Wed, 10 Jul 2019 19:52:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6E2EC289CB for ; Wed, 10 Jul 2019 19:52:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 626EB289DA; Wed, 10 Jul 2019 19:52:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B408B28765 for ; Wed, 10 Jul 2019 19:52:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728217AbfGJTwd (ORCPT ); Wed, 10 Jul 2019 15:52:33 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45260 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728138AbfGJTw3 (ORCPT ); Wed, 10 Jul 2019 15:52:29 -0400 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1FA2C307D850; Wed, 10 Jul 2019 19:52:29 +0000 (UTC) Received: from virtlab512.virt.lab.eng.bos.redhat.com (virtlab512.virt.lab.eng.bos.redhat.com [10.19.152.206]) by smtp.corp.redhat.com (Postfix) with ESMTP id 54F7119C69; Wed, 10 Jul 2019 19:52:27 +0000 (UTC) From: Nitesh Narayan Lal To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, david@redhat.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com, john.starks@microsoft.com, dave.hansen@intel.com, mhocko@suse.com Subject: [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host Date: Wed, 10 Jul 2019 15:51:58 -0400 Message-Id: <20190710195158.19640-3-nitesh@redhat.com> In-Reply-To: <20190710195158.19640-1-nitesh@redhat.com> References: <20190710195158.19640-1-nitesh@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.48]); Wed, 10 Jul 2019 19:52:29 +0000 (UTC) Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the host. If it is available and page_hinting_flag is set to true, page_hinting is enabled and its callbacks are configured along with the max_pages count which indicates the maximum number of pages that can be isolated and hinted at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are reported. To prevent any false OOM max_pages count is set to 16. By default page_hinting feature is enabled and gets loaded as soon as the virtio-balloon driver is loaded. However, it could be disabled by writing the page_hinting_flag which is a virtio-balloon parameter. Signed-off-by: Nitesh Narayan Lal --- drivers/virtio/Kconfig | 1 + drivers/virtio/virtio_balloon.c | 91 ++++++++++++++++++++++++++++- include/uapi/linux/virtio_balloon.h | 11 ++++ 3 files changed, 102 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 023fc3bc01c6..dcc0cb4269a5 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -47,6 +47,7 @@ config VIRTIO_BALLOON tristate "Virtio balloon driver" depends on VIRTIO select MEMORY_BALLOON + select PAGE_HINTING ---help--- This driver supports increasing and decreasing the amount of memory within a KVM guest. diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 44339fc87cc7..1fb0eb0b2c20 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -18,6 +18,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -35,6 +36,12 @@ /* The size of a free page block in bytes */ #define VIRTIO_BALLOON_FREE_PAGE_SIZE \ (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT)) +/* Number of isolated pages to be reported to the host at a time. + * TODO: + * 1. Set it via host. + * 2. Find an optimal value for this. + */ +#define PAGE_HINTING_MAX_PAGES 16 #ifdef CONFIG_BALLOON_COMPACTION static struct vfsmount *balloon_mnt; @@ -45,6 +52,7 @@ enum virtio_balloon_vq { VIRTIO_BALLOON_VQ_DEFLATE, VIRTIO_BALLOON_VQ_STATS, VIRTIO_BALLOON_VQ_FREE_PAGE, + VIRTIO_BALLOON_VQ_HINTING, VIRTIO_BALLOON_VQ_MAX }; @@ -54,7 +62,8 @@ enum virtio_balloon_config_read { struct virtio_balloon { struct virtio_device *vdev; - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq; + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq, + *hinting_vq; /* Balloon's own wq for cpu-intensive work items */ struct workqueue_struct *balloon_wq; @@ -112,6 +121,9 @@ struct virtio_balloon { /* To register a shrinker to shrink memory upon memory pressure */ struct shrinker shrinker; + + /* Array object pointing at the isolated pages ready for hinting */ + struct isolated_memory isolated_pages[PAGE_HINTING_MAX_PAGES]; }; static struct virtio_device_id id_table[] = { @@ -119,6 +131,66 @@ static struct virtio_device_id id_table[] = { { 0 }, }; +static struct page_hinting_config page_hinting_conf; +bool page_hinting_flag = true; +struct virtio_balloon *hvb; +module_param(page_hinting_flag, bool, 0444); +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting"); + +static int page_hinting_report(void) +{ + struct virtqueue *vq = hvb->hinting_vq; + struct scatterlist sg; + int err = 0, unused; + + mutex_lock(&hvb->balloon_lock); + sg_init_one(&sg, hvb->isolated_pages, sizeof(hvb->isolated_pages[0]) * + PAGE_HINTING_MAX_PAGES); + err = virtqueue_add_outbuf(vq, &sg, 1, hvb, GFP_KERNEL); + if (!err) + virtqueue_kick(hvb->hinting_vq); + wait_event(hvb->acked, virtqueue_get_buf(vq, &unused)); + mutex_unlock(&hvb->balloon_lock); + return err; +} + +void hint_pages(struct list_head *pages) +{ + struct device *dev = &hvb->vdev->dev; + struct page *page, *next; + int idx = 0, order, err; + unsigned long pfn; + + list_for_each_entry_safe(page, next, pages, lru) { + pfn = page_to_pfn(page); + order = page_private(page); + hvb->isolated_pages[idx].phys_addr = pfn << PAGE_SHIFT; + hvb->isolated_pages[idx].size = (1 << order) * PAGE_SIZE; + idx++; + } + err = page_hinting_report(); + if (err < 0) + dev_err(dev, "Failed to hint pages, err = %d\n", err); +} + +static void page_hinting_init(struct virtio_balloon *vb) +{ + struct device *dev = &vb->vdev->dev; + int err; + + page_hinting_conf.hint_pages = hint_pages; + page_hinting_conf.max_pages = PAGE_HINTING_MAX_PAGES; + err = page_hinting_enable(&page_hinting_conf); + if (err < 0) { + dev_err(dev, "Failed to enable page-hinting, err = %d\n", err); + page_hinting_flag = false; + page_hinting_conf.hint_pages = NULL; + page_hinting_conf.max_pages = 0; + return; + } + hvb = vb; +} + static u32 page_to_balloon_pfn(struct page *page) { unsigned long pfn = page_to_pfn(page); @@ -475,6 +547,7 @@ static int init_vqs(struct virtio_balloon *vb) names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate"; names[VIRTIO_BALLOON_VQ_STATS] = NULL; names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL; + names[VIRTIO_BALLOON_VQ_HINTING] = NULL; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { names[VIRTIO_BALLOON_VQ_STATS] = "stats"; @@ -486,11 +559,18 @@ static int init_vqs(struct virtio_balloon *vb) callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL; } + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) { + names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq"; + callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack; + } err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs, callbacks, names, NULL, NULL); if (err) return err; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) + vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING]; + vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE]; vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE]; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { @@ -929,6 +1009,9 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_del_balloon_wq; } + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) && + page_hinting_flag) + page_hinting_init(vb); virtio_device_ready(vdev); if (towards_target(vb)) @@ -976,6 +1059,10 @@ static void virtballoon_remove(struct virtio_device *vdev) destroy_workqueue(vb->balloon_wq); } + if (!page_hinting_flag) { + hvb = NULL; + page_hinting_disable(); + } remove_common(vb); #ifdef CONFIG_BALLOON_COMPACTION if (vb->vb_dev_info.inode) @@ -1030,8 +1117,10 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, + VIRTIO_BALLOON_F_HINTING, VIRTIO_BALLOON_F_FREE_PAGE_HINT, VIRTIO_BALLOON_F_PAGE_POISON, + VIRTIO_BALLOON_F_HINTING, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index a1966cd7b677..29eed0ec83d3 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -36,6 +36,8 @@ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */ #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */ +/* TODO: Find a better name to avoid any confusion with FREE_PAGE_HINT */ +#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -108,4 +110,13 @@ struct virtio_balloon_stat { __virtio64 val; } __attribute__((packed)); +/* + * struct isolated_memory- holds the pages which will be reported to the host. + * @phys_add: physical address associated with a page. + * @size: total size of memory to be reported. + */ +struct isolated_memory { + __virtio64 phys_addr; + __virtio64 size; +}; #endif /* _LINUX_VIRTIO_BALLOON_H */