diff mbox series

[RFC,v10,1/2] mm: page_hinting: core infrastructure

Message ID 20190603170306.49099-2-nitesh@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Support for page hinting | expand

Commit Message

Nitesh Narayan Lal June 3, 2019, 5:03 p.m. UTC
This patch introduces the core infrastructure for free page hinting in
virtual environments. It enables the kernel to track the free pages which
can be reported to its hypervisor so that the hypervisor could
free and reuse that memory as per its requirement.

While the pages are getting processed in the hypervisor (e.g.,
via MADV_FREE), the guest must not use them, otherwise, data loss
would be possible. To avoid such a situation, these pages are
temporarily removed from the buddy. The amount of pages removed
temporarily from the buddy is governed by the backend(virtio-balloon
in our case).

To efficiently identify free pages that can to be hinted to the
hypervisor, bitmaps in a coarse granularity are used. Only fairly big
chunks are reported to the hypervisor - especially, to not break up THP
in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
in the bitmap are an indication whether a page *might* be free, not a
guarantee. A new hook after buddy merging sets the bits.

Bitmaps are stored per zone, protected by the zone lock. A workqueue
asynchronously processes the bitmaps, trying to isolate and report pages
that are still free. The backend (virtio-balloon) is responsible for
reporting these batched pages to the host synchronously. Once reporting/
freeing is complete, isolated pages are returned back to the buddy.

There are still various things to look into (e.g., memory hotplug, more
efficient locking, possible races when disabling).

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/Kconfig       |   1 +
 include/linux/page_hinting.h |  46 +++++++
 mm/Kconfig                   |   6 +
 mm/Makefile                  |   2 +
 mm/page_alloc.c              |  17 +--
 mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
 6 files changed, 301 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

Comments

Alexander Duyck June 3, 2019, 7:04 p.m. UTC | #1
On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

So one thing I had thought about, that I don't believe that has been
addressed in your solution, is to determine a means to guarantee
forward progress. If you have a noisy thread that is allocating and
freeing some block of memory repeatedly you will be stuck processing
that and cannot get to the other work. Specifically if you have a zone
where somebody is just cycling the number of pages needed to fill your
hinting queue how do you get around it and get to the data that is
actually code instead of getting stuck processing the noise?

Do you have any idea what the hit rate would be on a system that is on
the more active side? From what I can tell you still are effectively
just doing a linear search of memory, but you have the bitmap hints to
tell what as not been freed recently, however you still don't know
that the pages you have bitmap hints for are actually free until you
check them.

> ---
>  drivers/virtio/Kconfig       |   1 +
>  include/linux/page_hinting.h |  46 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   2 +
>  mm/page_alloc.c              |  17 +--
>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..5a96b7a2ed1e 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>         tristate "Virtio balloon driver"
>         depends on VIRTIO
>         select MEMORY_BALLOON
> +       select PAGE_HINTING
>         ---help---
>          This driver supports increasing and decreasing the amount
>          of memory within a KVM guest.
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> new file mode 100644
> index 000000000000..e65188fe1e6b
> --- /dev/null
> +++ b/include/linux/page_hinting.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_HINTING_H
> +#define _LINUX_PAGE_HINTING_H
> +
> +/*
> + * Minimum page order required for a page to be hinted to the host.
> + */
> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
> +
> +/*
> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
> + * isolated pages.
> + * @prepare:           Callback responsible for allocating an array to hold
> + *                     the isolated pages.
> + * @hint_pages:                Callback which reports the isolated pages synchornously
> + *                     to the host.
> + * @cleanup:           Callback to free the the array used for reporting the
> + *                     isolated pages.
> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> + */
> +struct page_hinting_cb {
> +       int (*prepare)(void);
> +       void (*hint_pages)(struct list_head *list);
> +       void (*cleanup)(void);
> +       int max_pages;
> +};
> +
> +#ifdef CONFIG_PAGE_HINTING
> +void page_hinting_enqueue(struct page *page, int order);
> +void page_hinting_enable(const struct page_hinting_cb *cb);
> +void page_hinting_disable(void);
> +#else
> +static inline void page_hinting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
> +{
> +}
> +
> +static inline void page_hinting_disable(void)
> +{
> +}
> +#endif
> +#endif /* _LINUX_PAGE_HINTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ee8d1f311858..177d858de758 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>  config ARCH_HAS_PTE_SPECIAL
>         bool
>
> +# PAGE_HINTING will allow the guest to report the free pages to the
> +# host in regular interval of time.
> +config PAGE_HINTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index ac5e5ba78874..bec456dfee34 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            interval_tree.o list_lru.o workingset.o \
>                            debug.o $(mmu-y)
>
> +
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)      += cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b13d3914176..d12f69e0e402 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_hinting.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>   * -- nyc
>   */
>
> -static inline void __free_one_page(struct page *page,
> +inline void __free_one_page(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
> -               int migratetype)
> +               int migratetype, bool hint)
>  {
>         unsigned long combined_pfn;
>         unsigned long uninitialized_var(buddy_pfn);
> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>         set_page_order(page, order);
>
> +       if (hint)
> +               page_hinting_enqueue(page, order);

This is a bit early to probably be dealing with the hint. You should
probably look at moving this down to a spot somewhere after the page
has been added to the free list. It may not cause any issues with the
current order setup, but moving after the addition to the free list
will make it so that you know it is in there when you call this
function.

>         /*
>          * If this is not the largest possible page, check if the buddy
>          * of the next-highest order is free. If it is, it's possible
> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>                 if (unlikely(isolated_pageblocks))
>                         mt = get_pageblock_migratetype(page);
>
> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>                 trace_mm_page_pcpu_drain(page, 0, mt);
>         }
>         spin_unlock(&zone->lock);
> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>                                 struct page *page, unsigned long pfn,
>                                 unsigned int order,
> -                               int migratetype)
> +                               int migratetype, bool hint)
>  {
>         spin_lock(&zone->lock);
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
>                 migratetype = get_pfnblock_migratetype(page, pfn);
>         }
> -       __free_one_page(page, pfn, zone, order, migratetype);
> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>         spin_unlock(&zone->lock);
>  }
>
> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>         migratetype = get_pfnblock_migratetype(page, pfn);
>         local_irq_save(flags);
>         __count_vm_events(PGFREE, 1 << order);
> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>         local_irq_restore(flags);
>  }
>
> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>          */
>         if (migratetype >= MIGRATE_PCPTYPES) {
>                 if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>                         return;
>                 }
>                 migratetype = MIGRATE_MOVABLE;

So it looks like you are using a parameter to identify if the page is
a hinted page or not. I guess this works but it seems like it is a bit
intrusive as you are adding an argument to specify that this is a
specific page type.

> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> new file mode 100644
> index 000000000000..7341c6462de2
> --- /dev/null
> +++ b/mm/page_hinting.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page hinting support to enable a VM to report the freed pages back
> + * to the host.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_hinting.h>
> +#include <linux/kvm_host.h>
> +
> +/*
> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
> + * and other required parameters which could help in retrieving the original
> + * PFN value using the bitmap.
> + * @bitmap:            Pointer to the bitmap of free PFN.
> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
> + * @free_pages:                Tracks the number of free pages of granularity
> + *                     PAGE_HINTING_MIN_ORDER.
> + * @nbits:             Indicates the total size of the bitmap in bits allocated
> + *                     at the time of initialization.
> + */
> +struct hinting_bitmap {
> +       unsigned long *bitmap;
> +       unsigned long base_pfn;
> +       atomic_t free_pages;
> +       unsigned long nbits;
> +} bm_zone[MAX_NR_ZONES];
> +

This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?

> +static void init_hinting_wq(struct work_struct *work);
> +extern int __isolate_free_page(struct page *page, unsigned int order);
> +extern void __free_one_page(struct page *page, unsigned long pfn,
> +                           struct zone *zone, unsigned int order,
> +                           int migratetype, bool hint);
> +const struct page_hinting_cb *hcb;
> +struct work_struct hinting_work;
> +
> +static unsigned long find_bitmap_size(struct zone *zone)
> +{
> +       unsigned long nbits = ALIGN(zone->spanned_pages,
> +                           PAGE_HINTING_MIN_ORDER);
> +
> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
> +       return nbits;
> +}
> +

This doesn't look right to me. You are trying to do something like a
DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
Another option would be to just do DIV_ROUND_UP with the 1 <<
PAGE_HINTING_MIN_ORDER value.

> +void page_hinting_enable(const struct page_hinting_cb *callback)
> +{
> +       struct zone *zone;
> +       int idx = 0;
> +       unsigned long bitmap_size = 0;
> +
> +       for_each_populated_zone(zone) {

The index for this doesn't match up to the index you used to define
bm_zone. for_each_populated_zone will go through each zone in each
pgdat. Right now you can only handle one pgdat.

> +               spin_lock(&zone->lock);
> +               bitmap_size = find_bitmap_size(zone);
> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> +               if (!bm_zone[idx].bitmap)
> +                       return;
> +               bm_zone[idx].nbits = bitmap_size;
> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
> +               spin_unlock(&zone->lock);
> +               idx++;
> +       }
> +       hcb = callback;
> +       INIT_WORK(&hinting_work, init_hinting_wq);
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> +
> +void page_hinting_disable(void)
> +{
> +       struct zone *zone;
> +       int idx = 0;
> +
> +       cancel_work_sync(&hinting_work);
> +       hcb = NULL;
> +       for_each_populated_zone(zone) {
> +               spin_lock(&zone->lock);
> +               bitmap_free(bm_zone[idx].bitmap);
> +               bm_zone[idx].base_pfn = 0;
> +               bm_zone[idx].nbits = 0;
> +               atomic_set(&bm_zone[idx].free_pages, 0);
> +               spin_unlock(&zone->lock);
> +               idx++;
> +       }
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> +
> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
> +{
> +       unsigned long bitnr;
> +
> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
> +                        >> PAGE_HINTING_MIN_ORDER;
> +       return bitnr;
> +}
> +
> +static void release_buddy_pages(struct list_head *pages)
> +{
> +       int mt = 0, zonenum, order;
> +       struct page *page, *next;
> +       struct zone *zone;
> +       unsigned long bitnr;
> +
> +       list_for_each_entry_safe(page, next, pages, lru) {
> +               zonenum = page_zonenum(page);
> +               zone = page_zone(page);
> +               bitnr = pfn_to_bit(page, zonenum);
> +               spin_lock(&zone->lock);
> +               list_del(&page->lru);
> +               order = page_private(page);
> +               set_page_private(page, 0);
> +               mt = get_pageblock_migratetype(page);
> +               __free_one_page(page, page_to_pfn(page), zone,
> +                               order, mt, false);
> +               spin_unlock(&zone->lock);
> +       }
> +}
> +
> +static void bm_set_pfn(struct page *page)
> +{
> +       unsigned long bitnr = 0;
> +       int zonenum = page_zonenum(page);
> +       struct zone *zone = page_zone(page);
> +
> +       lockdep_assert_held(&zone->lock);
> +       bitnr = pfn_to_bit(page, zonenum);
> +       if (bm_zone[zonenum].bitmap &&
> +           bitnr < bm_zone[zonenum].nbits &&
> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
> +               atomic_inc(&bm_zone[zonenum].free_pages);
> +}
> +
> +static void scan_hinting_bitmap(int zonenum, int free_pages)
> +{
> +       unsigned long set_bit, start = 0;
> +       struct page *page;
> +       struct zone *zone;
> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
> +       LIST_HEAD(isolated_pages);
> +
> +       ret = hcb->prepare();
> +       if (ret < 0)
> +               return;
> +       for (;;) {
> +               ret = 0;
> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
> +                                       bm_zone[zonenum].nbits, start);
> +               if (set_bit >= bm_zone[zonenum].nbits)
> +                       break;
> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> +                               bm_zone[zonenum].base_pfn);
> +               if (!page)
> +                       continue;
> +               zone = page_zone(page);
> +               spin_lock(&zone->lock);
> +
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_HINTING_MIN_ORDER) {
> +                       order = page_private(page);
> +                       ret = __isolate_free_page(page, order);
> +               }
> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
> +               spin_unlock(&zone->lock);
> +               if (ret) {
> +                       /*
> +                        * restoring page order to use it while releasing
> +                        * the pages back to the buddy.
> +                        */
> +                       set_page_private(page, order);
> +                       list_add_tail(&page->lru, &isolated_pages);
> +                       isolated_cnt++;
> +                       if (isolated_cnt == hcb->max_pages) {
> +                               hcb->hint_pages(&isolated_pages);
> +                               release_buddy_pages(&isolated_pages);
> +                               isolated_cnt = 0;
> +                       }
> +               }
> +               start = set_bit + 1;
> +               scanned_pages++;
> +       }
> +       if (isolated_cnt) {
> +               hcb->hint_pages(&isolated_pages);
> +               release_buddy_pages(&isolated_pages);
> +       }
> +       hcb->cleanup();
> +       if (scanned_pages > free_pages)
> +               atomic_sub((scanned_pages - free_pages),
> +                          &bm_zone[zonenum].free_pages);
> +}
> +
> +static bool check_hinting_threshold(void)
> +{
> +       int zonenum = 0;
> +
> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
> +                               hcb->max_pages)
> +                       return true;
> +       }
> +       return false;
> +}
> +
> +static void init_hinting_wq(struct work_struct *work)
> +{
> +       int zonenum = 0, free_pages = 0;
> +
> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
> +               if (free_pages >= hcb->max_pages) {
> +                       /* Find a better way to synchronize per zone
> +                        * free_pages.
> +                        */
> +                       atomic_sub(free_pages,
> +                                  &bm_zone[zonenum].free_pages);
> +                       scan_hinting_bitmap(zonenum, free_pages);
> +               }
> +       }
> +}
> +
> +void page_hinting_enqueue(struct page *page, int order)
> +{
> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
> +               bm_set_pfn(page);
> +       else
> +               return;

You could probably flip the logic and save yourself an "else" by just
doing something like:
if (!hcb || order < PAGE_HINTING_MIN_ORDER)
        return;

I think it would also make this more readable.

> +
> +       if (check_hinting_threshold()) {
> +               int cpu = smp_processor_id();
> +
> +               queue_work_on(cpu, system_wq, &hinting_work);
> +       }
> +}
> --
> 2.21.0
>
David Hildenbrand June 3, 2019, 7:57 p.m. UTC | #2
On 03.06.19 19:03, Nitesh Narayan Lal wrote:
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
> 
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
> 
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
> 
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
> 
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/Kconfig       |   1 +
>  include/linux/page_hinting.h |  46 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   2 +
>  mm/page_alloc.c              |  17 +--
>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..5a96b7a2ed1e 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_HINTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> new file mode 100644
> index 000000000000..e65188fe1e6b
> --- /dev/null
> +++ b/include/linux/page_hinting.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_HINTING_H
> +#define _LINUX_PAGE_HINTING_H
> +
> +/*
> + * Minimum page order required for a page to be hinted to the host.
> + */
> +#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
> +
> +/*
> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
> + * isolated pages.
> + * @prepare:		Callback responsible for allocating an array to hold
> + *			the isolated pages.
> + * @hint_pages:		Callback which reports the isolated pages synchornously
> + *			to the host.
> + * @cleanup:		Callback to free the the array used for reporting the
> + *			isolated pages.
> + * @max_pages:		Maxmimum pages that are going to be hinted to the host
> + *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> + */
> +struct page_hinting_cb {
> +	int (*prepare)(void);
> +	void (*hint_pages)(struct list_head *list);
> +	void (*cleanup)(void);
> +	int max_pages;

If we allocate the array in virtio-balloon differently (e.g. similar to
bulk inflation/deflation of pfn's right now), we can most probably get
rid of prepare() and cleanup(), simplifying the code further.

> +};
> +
> +#ifdef CONFIG_PAGE_HINTING
> +void page_hinting_enqueue(struct page *page, int order);
> +void page_hinting_enable(const struct page_hinting_cb *cb);
> +void page_hinting_disable(void);
> +#else
> +static inline void page_hinting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
> +{
> +}
> +
> +static inline void page_hinting_disable(void)
> +{
> +}
> +#endif
> +#endif /* _LINUX_PAGE_HINTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ee8d1f311858..177d858de758 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>  config ARCH_HAS_PTE_SPECIAL
>  	bool
>  
> +# PAGE_HINTING will allow the guest to report the free pages to the
> +# host in regular interval of time.
> +config PAGE_HINTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index ac5e5ba78874..bec456dfee34 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -41,6 +41,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   interval_tree.o list_lru.o workingset.o \
>  			   debug.o $(mmu-y)
>  
> +
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)	+= z3fold.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)	+= cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b13d3914176..d12f69e0e402 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_hinting.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>   * -- nyc
>   */
>  
> -static inline void __free_one_page(struct page *page,
> +inline void __free_one_page(struct page *page,
>  		unsigned long pfn,
>  		struct zone *zone, unsigned int order,
> -		int migratetype)
> +		int migratetype, bool hint)
>  {
>  	unsigned long combined_pfn;
>  	unsigned long uninitialized_var(buddy_pfn);
> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>  	set_page_order(page, order);
>  
> +	if (hint)
> +		page_hinting_enqueue(page, order);
>  	/*
>  	 * If this is not the largest possible page, check if the buddy
>  	 * of the next-highest order is free. If it is, it's possible
> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		if (unlikely(isolated_pageblocks))
>  			mt = get_pageblock_migratetype(page);
>  
> -		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>  		trace_mm_page_pcpu_drain(page, 0, mt);
>  	}
>  	spin_unlock(&zone->lock);
> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>  				struct page *page, unsigned long pfn,
>  				unsigned int order,
> -				int migratetype)
> +				int migratetype, bool hint)
>  {
>  	spin_lock(&zone->lock);
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
>  	}
> -	__free_one_page(page, pfn, zone, order, migratetype);
> +	__free_one_page(page, pfn, zone, order, migratetype, hint);
>  	spin_unlock(&zone->lock);
>  }
>  
> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	migratetype = get_pfnblock_migratetype(page, pfn);
>  	local_irq_save(flags);
>  	__count_vm_events(PGFREE, 1 << order);
> -	free_one_page(page_zone(page), page, pfn, order, migratetype);
> +	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>  	local_irq_restore(flags);
>  }
>  
> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  	 */
>  	if (migratetype >= MIGRATE_PCPTYPES) {
>  		if (unlikely(is_migrate_isolate(migratetype))) {
> -			free_one_page(zone, page, pfn, 0, migratetype);
> +			free_one_page(zone, page, pfn, 0, migratetype, true);
>  			return;
>  		}
>  		migratetype = MIGRATE_MOVABLE;
> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> new file mode 100644
> index 000000000000..7341c6462de2
> --- /dev/null
> +++ b/mm/page_hinting.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page hinting support to enable a VM to report the freed pages back
> + * to the host.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_hinting.h>
> +#include <linux/kvm_host.h>
> +
> +/*
> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
> + * and other required parameters which could help in retrieving the original
> + * PFN value using the bitmap.
> + * @bitmap:		Pointer to the bitmap of free PFN.
> + * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
> + * @free_pages:		Tracks the number of free pages of granularity
> + *			PAGE_HINTING_MIN_ORDER.
> + * @nbits:		Indicates the total size of the bitmap in bits allocated
> + *			at the time of initialization.
> + */
> +struct hinting_bitmap {
> +	unsigned long *bitmap;
> +	unsigned long base_pfn;
> +	atomic_t free_pages;
> +	unsigned long nbits;
> +} bm_zone[MAX_NR_ZONES];
> +
> +static void init_hinting_wq(struct work_struct *work);
> +extern int __isolate_free_page(struct page *page, unsigned int order);
> +extern void __free_one_page(struct page *page, unsigned long pfn,
> +			    struct zone *zone, unsigned int order,
> +			    int migratetype, bool hint);
> +const struct page_hinting_cb *hcb;
> +struct work_struct hinting_work;
> +
> +static unsigned long find_bitmap_size(struct zone *zone)
> +{
> +	unsigned long nbits = ALIGN(zone->spanned_pages,
> +			    PAGE_HINTING_MIN_ORDER);
> +
> +	nbits = nbits >> PAGE_HINTING_MIN_ORDER;
> +	return nbits;

I think we can simplify this to

return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1;

> +}
> +
> +void page_hinting_enable(const struct page_hinting_cb *callback)
> +{
> +	struct zone *zone;
> +	int idx = 0;
> +	unsigned long bitmap_size = 0;

You should probably protect enabling via a mutex and return -EINVAL or
similar if we already have a callback set (if we ever have different
drivers). But this has very little priority :)

> +
> +	for_each_populated_zone(zone) {
> +		spin_lock(&zone->lock);
> +		bitmap_size = find_bitmap_size(zone);
> +		bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> +		if (!bm_zone[idx].bitmap)
> +			return;
> +		bm_zone[idx].nbits = bitmap_size;
> +		bm_zone[idx].base_pfn = zone->zone_start_pfn;
> +		spin_unlock(&zone->lock);
> +		idx++;
> +	}
> +	hcb = callback;
> +	INIT_WORK(&hinting_work, init_hinting_wq);

There are also possible races when enabling, you will have to take care
of at one point.

> +}
> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> +
> +void page_hinting_disable(void)
> +{
> +	struct zone *zone;
> +	int idx = 0;
> +
> +	cancel_work_sync(&hinting_work);
> +	hcb = NULL;
> +	for_each_populated_zone(zone) {
> +		spin_lock(&zone->lock);
> +		bitmap_free(bm_zone[idx].bitmap);
> +		bm_zone[idx].base_pfn = 0;
> +		bm_zone[idx].nbits = 0;
> +		atomic_set(&bm_zone[idx].free_pages, 0);
> +		spin_unlock(&zone->lock);
> +		idx++;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> +
> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
> +{
> +	unsigned long bitnr;
> +
> +	bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
> +			 >> PAGE_HINTING_MIN_ORDER;
> +	return bitnr;
> +}
> +
> +static void release_buddy_pages(struct list_head *pages)

maybe "release_isolated_pages", not sure.

> +{
> +	int mt = 0, zonenum, order;
> +	struct page *page, *next;
> +	struct zone *zone;
> +	unsigned long bitnr;
> +
> +	list_for_each_entry_safe(page, next, pages, lru) {
> +		zonenum = page_zonenum(page);
> +		zone = page_zone(page);
> +		bitnr = pfn_to_bit(page, zonenum);
> +		spin_lock(&zone->lock);
> +		list_del(&page->lru);
> +		order = page_private(page);
> +		set_page_private(page, 0);
> +		mt = get_pageblock_migratetype(page);
> +		__free_one_page(page, page_to_pfn(page), zone,
> +				order, mt, false);
> +		spin_unlock(&zone->lock);
> +	}
> +}
> +
> +static void bm_set_pfn(struct page *page)
> +{
> +	unsigned long bitnr = 0;
> +	int zonenum = page_zonenum(page);
> +	struct zone *zone = page_zone(page);
> +
> +	lockdep_assert_held(&zone->lock);
> +	bitnr = pfn_to_bit(page, zonenum);
> +	if (bm_zone[zonenum].bitmap &&
> +	    bitnr < bm_zone[zonenum].nbits &&
> +	    !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
> +		atomic_inc(&bm_zone[zonenum].free_pages);
> +}
> +
> +static void scan_hinting_bitmap(int zonenum, int free_pages)
> +{
> +	unsigned long set_bit, start = 0;
> +	struct page *page;
> +	struct zone *zone;
> +	int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
> +	LIST_HEAD(isolated_pages);
> +
> +	ret = hcb->prepare();
> +	if (ret < 0)
> +		return;
> +	for (;;) {
> +		ret = 0;
> +		set_bit = find_next_bit(bm_zone[zonenum].bitmap,
> +					bm_zone[zonenum].nbits, start);
> +		if (set_bit >= bm_zone[zonenum].nbits)
> +			break;
> +		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> +				bm_zone[zonenum].base_pfn);
> +		if (!page)
> +			continue;

You are not clearing the bit / decrementing the counter.

> +		zone = page_zone(page);
> +		spin_lock(&zone->lock);
> +
> +		if (PageBuddy(page) && page_private(page) >=
> +		    PAGE_HINTING_MIN_ORDER) {
> +			order = page_private(page);
> +			ret = __isolate_free_page(page, order);
> +		}
> +		clear_bit(set_bit, bm_zone[zonenum].bitmap);
> +		spin_unlock(&zone->lock);
> +		if (ret) {
> +			/*
> +			 * restoring page order to use it while releasing
> +			 * the pages back to the buddy.
> +			 */
> +			set_page_private(page, order);
> +			list_add_tail(&page->lru, &isolated_pages);
> +			isolated_cnt++;
> +			if (isolated_cnt == hcb->max_pages) {
> +				hcb->hint_pages(&isolated_pages);
> +				release_buddy_pages(&isolated_pages);
> +				isolated_cnt = 0;
> +			}
> +		}
> +		start = set_bit + 1;
> +		scanned_pages++;
> +	}
> +	if (isolated_cnt) {
> +		hcb->hint_pages(&isolated_pages);
> +		release_buddy_pages(&isolated_pages);
> +	}
> +	hcb->cleanup();
> +	if (scanned_pages > free_pages)
> +		atomic_sub((scanned_pages - free_pages),
> +			   &bm_zone[zonenum].free_pages);

This looks overly complicated. Can't we somehow simply decrement when
clearing a bit?

> +}
> +
> +static bool check_hinting_threshold(void)
> +{
> +	int zonenum = 0;
> +
> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +		if (atomic_read(&bm_zone[zonenum].free_pages) >=
> +				hcb->max_pages)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void init_hinting_wq(struct work_struct *work)
> +{
> +	int zonenum = 0, free_pages = 0;
> +
> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +		free_pages = atomic_read(&bm_zone[zonenum].free_pages);
> +		if (free_pages >= hcb->max_pages) {
> +			/* Find a better way to synchronize per zone
> +			 * free_pages.
> +			 */
> +			atomic_sub(free_pages,
> +				   &bm_zone[zonenum].free_pages);

I can't follow yet why we need that information. Wouldn't it be enough
to just track the number of set bits in the bitmap and start hinting
depending on that count? (there are false positives, but do we really care?)

> +			scan_hinting_bitmap(zonenum, free_pages);
> +		}
> +	}
> +}
> +
> +void page_hinting_enqueue(struct page *page, int order)
> +{
> +	if (hcb && order >= PAGE_HINTING_MIN_ORDER)
> +		bm_set_pfn(page);
> +	else
> +		return;
> +
> +	if (check_hinting_threshold()) {
> +		int cpu = smp_processor_id();
> +
> +		queue_work_on(cpu, system_wq, &hinting_work);
> +	}
> +}
>
Nitesh Narayan Lal June 4, 2019, 12:55 p.m. UTC | #3
On 6/3/19 3:04 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> So one thing I had thought about, that I don't believe that has been
> addressed in your solution, is to determine a means to guarantee
> forward progress. If you have a noisy thread that is allocating and
> freeing some block of memory repeatedly you will be stuck processing
> that and cannot get to the other work. Specifically if you have a zone
> where somebody is just cycling the number of pages needed to fill your
> hinting queue how do you get around it and get to the data that is
> actually code instead of getting stuck processing the noise?

It should not matter. As every time the memory threshold is met, entire
bitmap
is scanned and not just a chunk of memory for possible isolation. This
will guarantee
forward progress.

> Do you have any idea what the hit rate would be on a system that is on
> the more active side? From what I can tell you still are effectively
> just doing a linear search of memory, but you have the bitmap hints to
> tell what as not been freed recently, however you still don't know
> that the pages you have bitmap hints for are actually free until you
> check them.
>
>> ---
>>  drivers/virtio/Kconfig       |   1 +
>>  include/linux/page_hinting.h |  46 +++++++
>>  mm/Kconfig                   |   6 +
>>  mm/Makefile                  |   2 +
>>  mm/page_alloc.c              |  17 +--
>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 mm/page_hinting.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 35897649c24f..5a96b7a2ed1e 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>         tristate "Virtio balloon driver"
>>         depends on VIRTIO
>>         select MEMORY_BALLOON
>> +       select PAGE_HINTING
>>         ---help---
>>          This driver supports increasing and decreasing the amount
>>          of memory within a KVM guest.
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..e65188fe1e6b
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
>> +
>> +/*
>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>> + * isolated pages.
>> + * @prepare:           Callback responsible for allocating an array to hold
>> + *                     the isolated pages.
>> + * @hint_pages:                Callback which reports the isolated pages synchornously
>> + *                     to the host.
>> + * @cleanup:           Callback to free the the array used for reporting the
>> + *                     isolated pages.
>> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
>> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_cb {
>> +       int (*prepare)(void);
>> +       void (*hint_pages)(struct list_head *list);
>> +       void (*cleanup)(void);
>> +       int max_pages;
>> +};
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>> +{
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ee8d1f311858..177d858de758 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>  config ARCH_HAS_PTE_SPECIAL
>>         bool
>>
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config PAGE_HINTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..bec456dfee34 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>>                            interval_tree.o list_lru.o workingset.o \
>>                            debug.o $(mmu-y)
>>
>> +
>>  # Give 'page_alloc' its own module-parameter namespace
>>  page-alloc-y := page_alloc.o
>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>  obj-$(CONFIG_CMA)      += cma.o
>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b13d3914176..d12f69e0e402 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>   * -- nyc
>>   */
>>
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>>                 unsigned long pfn,
>>                 struct zone *zone, unsigned int order,
>> -               int migratetype)
>> +               int migratetype, bool hint)
>>  {
>>         unsigned long combined_pfn;
>>         unsigned long uninitialized_var(buddy_pfn);
>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>  done_merging:
>>         set_page_order(page, order);
>>
>> +       if (hint)
>> +               page_hinting_enqueue(page, order);
> This is a bit early to probably be dealing with the hint. You should
> probably look at moving this down to a spot somewhere after the page
> has been added to the free list. It may not cause any issues with the
> current order setup, but moving after the addition to the free list
> will make it so that you know it is in there when you call this
> function.
I will take a look at this.
>
>>         /*
>>          * If this is not the largest possible page, check if the buddy
>>          * of the next-highest order is free. If it is, it's possible
>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>                 if (unlikely(isolated_pageblocks))
>>                         mt = get_pageblock_migratetype(page);
>>
>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>         }
>>         spin_unlock(&zone->lock);
>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>                                 struct page *page, unsigned long pfn,
>>                                 unsigned int order,
>> -                               int migratetype)
>> +                               int migratetype, bool hint)
>>  {
>>         spin_lock(&zone->lock);
>>         if (unlikely(has_isolate_pageblock(zone) ||
>>                 is_migrate_isolate(migratetype))) {
>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>         }
>> -       __free_one_page(page, pfn, zone, order, migratetype);
>> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>>         spin_unlock(&zone->lock);
>>  }
>>
>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>         local_irq_save(flags);
>>         __count_vm_events(PGFREE, 1 << order);
>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>         local_irq_restore(flags);
>>  }
>>
>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>          */
>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>                         return;
>>                 }
>>                 migratetype = MIGRATE_MOVABLE;
> So it looks like you are using a parameter to identify if the page is
> a hinted page or not. I guess this works but it seems like it is a bit
> intrusive as you are adding an argument to specify that this is a
> specific page type.
Any suggestions on how we could do this in a less intrusive manner?
>
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..7341c6462de2
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,236 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting support to enable a VM to report the freed pages back
>> + * to the host.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>> + * and other required parameters which could help in retrieving the original
>> + * PFN value using the bitmap.
>> + * @bitmap:            Pointer to the bitmap of free PFN.
>> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
>> + * @free_pages:                Tracks the number of free pages of granularity
>> + *                     PAGE_HINTING_MIN_ORDER.
>> + * @nbits:             Indicates the total size of the bitmap in bits allocated
>> + *                     at the time of initialization.
>> + */
>> +struct hinting_bitmap {
>> +       unsigned long *bitmap;
>> +       unsigned long base_pfn;
>> +       atomic_t free_pages;
>> +       unsigned long nbits;
>> +} bm_zone[MAX_NR_ZONES];
>> +
> This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?
I will have to look into this.
>
>> +static void init_hinting_wq(struct work_struct *work);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> +                           struct zone *zone, unsigned int order,
>> +                           int migratetype, bool hint);
>> +const struct page_hinting_cb *hcb;
>> +struct work_struct hinting_work;
>> +
>> +static unsigned long find_bitmap_size(struct zone *zone)
>> +{
>> +       unsigned long nbits = ALIGN(zone->spanned_pages,
>> +                           PAGE_HINTING_MIN_ORDER);
>> +
>> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>> +       return nbits;
>> +}
>> +
> This doesn't look right to me. You are trying to do something like a
> DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
> PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
> Another option would be to just do DIV_ROUND_UP with the 1 <<
> PAGE_HINTING_MIN_ORDER value.
I will double check this.
>
>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>> +{
>> +       struct zone *zone;
>> +       int idx = 0;
>> +       unsigned long bitmap_size = 0;
>> +
>> +       for_each_populated_zone(zone) {
> The index for this doesn't match up to the index you used to define
> bm_zone. for_each_populated_zone will go through each zone in each
> pgdat. Right now you can only handle one pgdat.
Not sure if I understood this entirely. Can you please explain more on this?
>
>> +               spin_lock(&zone->lock);
>> +               bitmap_size = find_bitmap_size(zone);
>> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +               if (!bm_zone[idx].bitmap)
>> +                       return;
>> +               bm_zone[idx].nbits = bitmap_size;
>> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
>> +               spin_unlock(&zone->lock);
>> +               idx++;
>> +       }
>> +       hcb = callback;
>> +       INIT_WORK(&hinting_work, init_hinting_wq);
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> +       struct zone *zone;
>> +       int idx = 0;
>> +
>> +       cancel_work_sync(&hinting_work);
>> +       hcb = NULL;
>> +       for_each_populated_zone(zone) {
>> +               spin_lock(&zone->lock);
>> +               bitmap_free(bm_zone[idx].bitmap);
>> +               bm_zone[idx].base_pfn = 0;
>> +               bm_zone[idx].nbits = 0;
>> +               atomic_set(&bm_zone[idx].free_pages, 0);
>> +               spin_unlock(&zone->lock);
>> +               idx++;
>> +       }
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>> +{
>> +       unsigned long bitnr;
>> +
>> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>> +                        >> PAGE_HINTING_MIN_ORDER;
>> +       return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
>> +{
>> +       int mt = 0, zonenum, order;
>> +       struct page *page, *next;
>> +       struct zone *zone;
>> +       unsigned long bitnr;
>> +
>> +       list_for_each_entry_safe(page, next, pages, lru) {
>> +               zonenum = page_zonenum(page);
>> +               zone = page_zone(page);
>> +               bitnr = pfn_to_bit(page, zonenum);
>> +               spin_lock(&zone->lock);
>> +               list_del(&page->lru);
>> +               order = page_private(page);
>> +               set_page_private(page, 0);
>> +               mt = get_pageblock_migratetype(page);
>> +               __free_one_page(page, page_to_pfn(page), zone,
>> +                               order, mt, false);
>> +               spin_unlock(&zone->lock);
>> +       }
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +       unsigned long bitnr = 0;
>> +       int zonenum = page_zonenum(page);
>> +       struct zone *zone = page_zone(page);
>> +
>> +       lockdep_assert_held(&zone->lock);
>> +       bitnr = pfn_to_bit(page, zonenum);
>> +       if (bm_zone[zonenum].bitmap &&
>> +           bitnr < bm_zone[zonenum].nbits &&
>> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>> +               atomic_inc(&bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>> +{
>> +       unsigned long set_bit, start = 0;
>> +       struct page *page;
>> +       struct zone *zone;
>> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>> +       LIST_HEAD(isolated_pages);
>> +
>> +       ret = hcb->prepare();
>> +       if (ret < 0)
>> +               return;
>> +       for (;;) {
>> +               ret = 0;
>> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>> +                                       bm_zone[zonenum].nbits, start);
>> +               if (set_bit >= bm_zone[zonenum].nbits)
>> +                       break;
>> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> +                               bm_zone[zonenum].base_pfn);
>> +               if (!page)
>> +                       continue;
>> +               zone = page_zone(page);
>> +               spin_lock(&zone->lock);
>> +
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_HINTING_MIN_ORDER) {
>> +                       order = page_private(page);
>> +                       ret = __isolate_free_page(page, order);
>> +               }
>> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
>> +               spin_unlock(&zone->lock);
>> +               if (ret) {
>> +                       /*
>> +                        * restoring page order to use it while releasing
>> +                        * the pages back to the buddy.
>> +                        */
>> +                       set_page_private(page, order);
>> +                       list_add_tail(&page->lru, &isolated_pages);
>> +                       isolated_cnt++;
>> +                       if (isolated_cnt == hcb->max_pages) {
>> +                               hcb->hint_pages(&isolated_pages);
>> +                               release_buddy_pages(&isolated_pages);
>> +                               isolated_cnt = 0;
>> +                       }
>> +               }
>> +               start = set_bit + 1;
>> +               scanned_pages++;
>> +       }
>> +       if (isolated_cnt) {
>> +               hcb->hint_pages(&isolated_pages);
>> +               release_buddy_pages(&isolated_pages);
>> +       }
>> +       hcb->cleanup();
>> +       if (scanned_pages > free_pages)
>> +               atomic_sub((scanned_pages - free_pages),
>> +                          &bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static bool check_hinting_threshold(void)
>> +{
>> +       int zonenum = 0;
>> +
>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
>> +                               hcb->max_pages)
>> +                       return true;
>> +       }
>> +       return false;
>> +}
>> +
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> +       int zonenum = 0, free_pages = 0;
>> +
>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>> +               if (free_pages >= hcb->max_pages) {
>> +                       /* Find a better way to synchronize per zone
>> +                        * free_pages.
>> +                        */
>> +                       atomic_sub(free_pages,
>> +                                  &bm_zone[zonenum].free_pages);
>> +                       scan_hinting_bitmap(zonenum, free_pages);
>> +               }
>> +       }
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>> +               bm_set_pfn(page);
>> +       else
>> +               return;
> You could probably flip the logic and save yourself an "else" by just
> doing something like:
> if (!hcb || order < PAGE_HINTING_MIN_ORDER)
>         return;
>
> I think it would also make this more readable.
>
+1
>> +
>> +       if (check_hinting_threshold()) {
>> +               int cpu = smp_processor_id();
>> +
>> +               queue_work_on(cpu, system_wq, &hinting_work);
>> +       }
>> +}
>> --
>> 2.21.0
>>
Nitesh Narayan Lal June 4, 2019, 1:16 p.m. UTC | #4
On 6/3/19 3:57 PM, David Hildenbrand wrote:
> On 03.06.19 19:03, Nitesh Narayan Lal wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/Kconfig       |   1 +
>>  include/linux/page_hinting.h |  46 +++++++
>>  mm/Kconfig                   |   6 +
>>  mm/Makefile                  |   2 +
>>  mm/page_alloc.c              |  17 +--
>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 mm/page_hinting.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 35897649c24f..5a96b7a2ed1e 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>  	tristate "Virtio balloon driver"
>>  	depends on VIRTIO
>>  	select MEMORY_BALLOON
>> +	select PAGE_HINTING
>>  	---help---
>>  	 This driver supports increasing and decreasing the amount
>>  	 of memory within a KVM guest.
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..e65188fe1e6b
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
>> +
>> +/*
>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>> + * isolated pages.
>> + * @prepare:		Callback responsible for allocating an array to hold
>> + *			the isolated pages.
>> + * @hint_pages:		Callback which reports the isolated pages synchornously
>> + *			to the host.
>> + * @cleanup:		Callback to free the the array used for reporting the
>> + *			isolated pages.
>> + * @max_pages:		Maxmimum pages that are going to be hinted to the host
>> + *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_cb {
>> +	int (*prepare)(void);
>> +	void (*hint_pages)(struct list_head *list);
>> +	void (*cleanup)(void);
>> +	int max_pages;
> If we allocate the array in virtio-balloon differently (e.g. similar to
> bulk inflation/deflation of pfn's right now), we can most probably get
> rid of prepare() and cleanup(), simplifying the code further.
Yeap, as Alexander suggested. I will go for static allocation and remove
this prepare() and cleanup().
>> +};
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>> +{
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ee8d1f311858..177d858de758 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>  config ARCH_HAS_PTE_SPECIAL
>>  	bool
>>  
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config PAGE_HINTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..bec456dfee34 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>>  			   interval_tree.o list_lru.o workingset.o \
>>  			   debug.o $(mmu-y)
>>  
>> +
>>  # Give 'page_alloc' its own module-parameter namespace
>>  page-alloc-y := page_alloc.o
>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)	+= z3fold.o
>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>  obj-$(CONFIG_CMA)	+= cma.o
>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b13d3914176..d12f69e0e402 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>  
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>   * -- nyc
>>   */
>>  
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>>  		unsigned long pfn,
>>  		struct zone *zone, unsigned int order,
>> -		int migratetype)
>> +		int migratetype, bool hint)
>>  {
>>  	unsigned long combined_pfn;
>>  	unsigned long uninitialized_var(buddy_pfn);
>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>  done_merging:
>>  	set_page_order(page, order);
>>  
>> +	if (hint)
>> +		page_hinting_enqueue(page, order);
>>  	/*
>>  	 * If this is not the largest possible page, check if the buddy
>>  	 * of the next-highest order is free. If it is, it's possible
>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  		if (unlikely(isolated_pageblocks))
>>  			mt = get_pageblock_migratetype(page);
>>  
>> -		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>  		trace_mm_page_pcpu_drain(page, 0, mt);
>>  	}
>>  	spin_unlock(&zone->lock);
>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>  				struct page *page, unsigned long pfn,
>>  				unsigned int order,
>> -				int migratetype)
>> +				int migratetype, bool hint)
>>  {
>>  	spin_lock(&zone->lock);
>>  	if (unlikely(has_isolate_pageblock(zone) ||
>>  		is_migrate_isolate(migratetype))) {
>>  		migratetype = get_pfnblock_migratetype(page, pfn);
>>  	}
>> -	__free_one_page(page, pfn, zone, order, migratetype);
>> +	__free_one_page(page, pfn, zone, order, migratetype, hint);
>>  	spin_unlock(&zone->lock);
>>  }
>>  
>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>  	migratetype = get_pfnblock_migratetype(page, pfn);
>>  	local_irq_save(flags);
>>  	__count_vm_events(PGFREE, 1 << order);
>> -	free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>  	local_irq_restore(flags);
>>  }
>>  
>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>  	 */
>>  	if (migratetype >= MIGRATE_PCPTYPES) {
>>  		if (unlikely(is_migrate_isolate(migratetype))) {
>> -			free_one_page(zone, page, pfn, 0, migratetype);
>> +			free_one_page(zone, page, pfn, 0, migratetype, true);
>>  			return;
>>  		}
>>  		migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..7341c6462de2
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,236 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting support to enable a VM to report the freed pages back
>> + * to the host.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>> + * and other required parameters which could help in retrieving the original
>> + * PFN value using the bitmap.
>> + * @bitmap:		Pointer to the bitmap of free PFN.
>> + * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
>> + * @free_pages:		Tracks the number of free pages of granularity
>> + *			PAGE_HINTING_MIN_ORDER.
>> + * @nbits:		Indicates the total size of the bitmap in bits allocated
>> + *			at the time of initialization.
>> + */
>> +struct hinting_bitmap {
>> +	unsigned long *bitmap;
>> +	unsigned long base_pfn;
>> +	atomic_t free_pages;
>> +	unsigned long nbits;
>> +} bm_zone[MAX_NR_ZONES];
>> +
>> +static void init_hinting_wq(struct work_struct *work);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> +			    struct zone *zone, unsigned int order,
>> +			    int migratetype, bool hint);
>> +const struct page_hinting_cb *hcb;
>> +struct work_struct hinting_work;
>> +
>> +static unsigned long find_bitmap_size(struct zone *zone)
>> +{
>> +	unsigned long nbits = ALIGN(zone->spanned_pages,
>> +			    PAGE_HINTING_MIN_ORDER);
>> +
>> +	nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>> +	return nbits;
> I think we can simplify this to
>
> return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1;
>
I will check this.
>> +}
>> +
>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>> +{
>> +	struct zone *zone;
>> +	int idx = 0;
>> +	unsigned long bitmap_size = 0;
> You should probably protect enabling via a mutex and return -EINVAL or
> similar if we already have a callback set (if we ever have different
> drivers). But this has very little priority :)
I will have to look into this.
>> +
>> +	for_each_populated_zone(zone) {
>> +		spin_lock(&zone->lock);
>> +		bitmap_size = find_bitmap_size(zone);
>> +		bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +		if (!bm_zone[idx].bitmap)
>> +			return;
>> +		bm_zone[idx].nbits = bitmap_size;
>> +		bm_zone[idx].base_pfn = zone->zone_start_pfn;
>> +		spin_unlock(&zone->lock);
>> +		idx++;
>> +	}
>> +	hcb = callback;
>> +	INIT_WORK(&hinting_work, init_hinting_wq);
> There are also possible races when enabling, you will have to take care
> of at one point.
No page will be enqueued until hcb is not set.
I can probably move it below INIT_WORK.
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> +	struct zone *zone;
>> +	int idx = 0;
>> +
>> +	cancel_work_sync(&hinting_work);
>> +	hcb = NULL;
>> +	for_each_populated_zone(zone) {
>> +		spin_lock(&zone->lock);
>> +		bitmap_free(bm_zone[idx].bitmap);
>> +		bm_zone[idx].base_pfn = 0;
>> +		bm_zone[idx].nbits = 0;
>> +		atomic_set(&bm_zone[idx].free_pages, 0);
>> +		spin_unlock(&zone->lock);
>> +		idx++;
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>> +{
>> +	unsigned long bitnr;
>> +
>> +	bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>> +			 >> PAGE_HINTING_MIN_ORDER;
>> +	return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
> maybe "release_isolated_pages", not sure.
>
>> +{
>> +	int mt = 0, zonenum, order;
>> +	struct page *page, *next;
>> +	struct zone *zone;
>> +	unsigned long bitnr;
>> +
>> +	list_for_each_entry_safe(page, next, pages, lru) {
>> +		zonenum = page_zonenum(page);
>> +		zone = page_zone(page);
>> +		bitnr = pfn_to_bit(page, zonenum);
>> +		spin_lock(&zone->lock);
>> +		list_del(&page->lru);
>> +		order = page_private(page);
>> +		set_page_private(page, 0);
>> +		mt = get_pageblock_migratetype(page);
>> +		__free_one_page(page, page_to_pfn(page), zone,
>> +				order, mt, false);
>> +		spin_unlock(&zone->lock);
>> +	}
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +	unsigned long bitnr = 0;
>> +	int zonenum = page_zonenum(page);
>> +	struct zone *zone = page_zone(page);
>> +
>> +	lockdep_assert_held(&zone->lock);
>> +	bitnr = pfn_to_bit(page, zonenum);
>> +	if (bm_zone[zonenum].bitmap &&
>> +	    bitnr < bm_zone[zonenum].nbits &&
>> +	    !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>> +		atomic_inc(&bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>> +{
>> +	unsigned long set_bit, start = 0;
>> +	struct page *page;
>> +	struct zone *zone;
>> +	int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>> +	LIST_HEAD(isolated_pages);
>> +
>> +	ret = hcb->prepare();
>> +	if (ret < 0)
>> +		return;
>> +	for (;;) {
>> +		ret = 0;
>> +		set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>> +					bm_zone[zonenum].nbits, start);
>> +		if (set_bit >= bm_zone[zonenum].nbits)
>> +			break;
>> +		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> +				bm_zone[zonenum].base_pfn);
>> +		if (!page)
>> +			continue;
> You are not clearing the bit / decrementing the counter.
>
>> +		zone = page_zone(page);
>> +		spin_lock(&zone->lock);
>> +
>> +		if (PageBuddy(page) && page_private(page) >=
>> +		    PAGE_HINTING_MIN_ORDER) {
>> +			order = page_private(page);
>> +			ret = __isolate_free_page(page, order);
>> +		}
>> +		clear_bit(set_bit, bm_zone[zonenum].bitmap);
>> +		spin_unlock(&zone->lock);
>> +		if (ret) {
>> +			/*
>> +			 * restoring page order to use it while releasing
>> +			 * the pages back to the buddy.
>> +			 */
>> +			set_page_private(page, order);
>> +			list_add_tail(&page->lru, &isolated_pages);
>> +			isolated_cnt++;
>> +			if (isolated_cnt == hcb->max_pages) {
>> +				hcb->hint_pages(&isolated_pages);
>> +				release_buddy_pages(&isolated_pages);
>> +				isolated_cnt = 0;
>> +			}
>> +		}
>> +		start = set_bit + 1;
>> +		scanned_pages++;
>> +	}
>> +	if (isolated_cnt) {
>> +		hcb->hint_pages(&isolated_pages);
>> +		release_buddy_pages(&isolated_pages);
>> +	}
>> +	hcb->cleanup();
>> +	if (scanned_pages > free_pages)
>> +		atomic_sub((scanned_pages - free_pages),
>> +			   &bm_zone[zonenum].free_pages);
> This looks overly complicated. Can't we somehow simply decrement when
> clearing a bit?
>
>> +}
>> +
>> +static bool check_hinting_threshold(void)
>> +{
>> +	int zonenum = 0;
>> +
>> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +		if (atomic_read(&bm_zone[zonenum].free_pages) >=
>> +				hcb->max_pages)
>> +			return true;
>> +	}
>> +	return false;
>> +}
>> +
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> +	int zonenum = 0, free_pages = 0;
>> +
>> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +		free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>> +		if (free_pages >= hcb->max_pages) {
>> +			/* Find a better way to synchronize per zone
>> +			 * free_pages.
>> +			 */
>> +			atomic_sub(free_pages,
>> +				   &bm_zone[zonenum].free_pages);
> I can't follow yet why we need that information. 
We don't want to enqueue multiple jobs just because we are delaying the
decrementing of free_pages.
I agree, even I am not convinced with the approach which I have right now.
I will try to come up with a better way.
> Wouldn't it be enough
> to just track the number of set bits in the bitmap and start hinting
> depending on that count? (there are false positives, but do we really care?)
In an attempt to minimize the false positives, I might have overly
complicated this part.
I will try to simplify this in the next posting.
>

>> +			scan_hinting_bitmap(zonenum, free_pages);
>> +		}
>> +	}
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +	if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>> +		bm_set_pfn(page);
>> +	else
>> +		return;
>> +
>> +	if (check_hinting_threshold()) {
>> +		int cpu = smp_processor_id();
>> +
>> +		queue_work_on(cpu, system_wq, &hinting_work);
>> +	}
>> +}
>>
>
Alexander Duyck June 4, 2019, 3:14 p.m. UTC | #5
On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/3/19 3:04 PM, Alexander Duyck wrote:
> > On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page hinting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_FREE), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be hinted to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> There are still various things to look into (e.g., memory hotplug, more
> >> efficient locking, possible races when disabling).
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > So one thing I had thought about, that I don't believe that has been
> > addressed in your solution, is to determine a means to guarantee
> > forward progress. If you have a noisy thread that is allocating and
> > freeing some block of memory repeatedly you will be stuck processing
> > that and cannot get to the other work. Specifically if you have a zone
> > where somebody is just cycling the number of pages needed to fill your
> > hinting queue how do you get around it and get to the data that is
> > actually code instead of getting stuck processing the noise?
>
> It should not matter. As every time the memory threshold is met, entire
> bitmap
> is scanned and not just a chunk of memory for possible isolation. This
> will guarantee
> forward progress.

So I think there may still be some issues. I see how you go from the
start to the end, but how to you loop back to the start again as pages
are added? The init_hinting_wq doesn't seem to have a way to get back
to the start again if there is still work to do after you have
completed your pass without queue_work_on firing off another thread.

> > Do you have any idea what the hit rate would be on a system that is on
> > the more active side? From what I can tell you still are effectively
> > just doing a linear search of memory, but you have the bitmap hints to
> > tell what as not been freed recently, however you still don't know
> > that the pages you have bitmap hints for are actually free until you
> > check them.
> >
> >> ---
> >>  drivers/virtio/Kconfig       |   1 +
> >>  include/linux/page_hinting.h |  46 +++++++
> >>  mm/Kconfig                   |   6 +
> >>  mm/Makefile                  |   2 +
> >>  mm/page_alloc.c              |  17 +--
> >>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
> >>  6 files changed, 301 insertions(+), 7 deletions(-)
> >>  create mode 100644 include/linux/page_hinting.h
> >>  create mode 100644 mm/page_hinting.c
> >>
> >> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> >> index 35897649c24f..5a96b7a2ed1e 100644
> >> --- a/drivers/virtio/Kconfig
> >> +++ b/drivers/virtio/Kconfig
> >> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
> >>         tristate "Virtio balloon driver"
> >>         depends on VIRTIO
> >>         select MEMORY_BALLOON
> >> +       select PAGE_HINTING
> >>         ---help---
> >>          This driver supports increasing and decreasing the amount
> >>          of memory within a KVM guest.
> >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> >> new file mode 100644
> >> index 000000000000..e65188fe1e6b
> >> --- /dev/null
> >> +++ b/include/linux/page_hinting.h
> >> @@ -0,0 +1,46 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#ifndef _LINUX_PAGE_HINTING_H
> >> +#define _LINUX_PAGE_HINTING_H
> >> +
> >> +/*
> >> + * Minimum page order required for a page to be hinted to the host.
> >> + */
> >> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
> >> +
> >> +/*
> >> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
> >> + * isolated pages.
> >> + * @prepare:           Callback responsible for allocating an array to hold
> >> + *                     the isolated pages.
> >> + * @hint_pages:                Callback which reports the isolated pages synchornously
> >> + *                     to the host.
> >> + * @cleanup:           Callback to free the the array used for reporting the
> >> + *                     isolated pages.
> >> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
> >> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> >> + */
> >> +struct page_hinting_cb {
> >> +       int (*prepare)(void);
> >> +       void (*hint_pages)(struct list_head *list);
> >> +       void (*cleanup)(void);
> >> +       int max_pages;
> >> +};
> >> +
> >> +#ifdef CONFIG_PAGE_HINTING
> >> +void page_hinting_enqueue(struct page *page, int order);
> >> +void page_hinting_enable(const struct page_hinting_cb *cb);
> >> +void page_hinting_disable(void);
> >> +#else
> >> +static inline void page_hinting_enqueue(struct page *page, int order)
> >> +{
> >> +}
> >> +
> >> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
> >> +{
> >> +}
> >> +
> >> +static inline void page_hinting_disable(void)
> >> +{
> >> +}
> >> +#endif
> >> +#endif /* _LINUX_PAGE_HINTING_H */
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index ee8d1f311858..177d858de758 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
> >>  config ARCH_HAS_PTE_SPECIAL
> >>         bool
> >>
> >> +# PAGE_HINTING will allow the guest to report the free pages to the
> >> +# host in regular interval of time.
> >> +config PAGE_HINTING
> >> +       bool
> >> +       def_bool n
> >> +       depends on X86_64
> >>  endmenu
> >> diff --git a/mm/Makefile b/mm/Makefile
> >> index ac5e5ba78874..bec456dfee34 100644
> >> --- a/mm/Makefile
> >> +++ b/mm/Makefile
> >> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
> >>                            interval_tree.o list_lru.o workingset.o \
> >>                            debug.o $(mmu-y)
> >>
> >> +
> >>  # Give 'page_alloc' its own module-parameter namespace
> >>  page-alloc-y := page_alloc.o
> >>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> >> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
> >>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
> >>  obj-$(CONFIG_CMA)      += cma.o
> >>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> >> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
> >>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 3b13d3914176..d12f69e0e402 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -68,6 +68,7 @@
> >>  #include <linux/lockdep.h>
> >>  #include <linux/nmi.h>
> >>  #include <linux/psi.h>
> >> +#include <linux/page_hinting.h>
> >>
> >>  #include <asm/sections.h>
> >>  #include <asm/tlbflush.h>
> >> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
> >>   * -- nyc
> >>   */
> >>
> >> -static inline void __free_one_page(struct page *page,
> >> +inline void __free_one_page(struct page *page,
> >>                 unsigned long pfn,
> >>                 struct zone *zone, unsigned int order,
> >> -               int migratetype)
> >> +               int migratetype, bool hint)
> >>  {
> >>         unsigned long combined_pfn;
> >>         unsigned long uninitialized_var(buddy_pfn);
> >> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
> >>  done_merging:
> >>         set_page_order(page, order);
> >>
> >> +       if (hint)
> >> +               page_hinting_enqueue(page, order);
> > This is a bit early to probably be dealing with the hint. You should
> > probably look at moving this down to a spot somewhere after the page
> > has been added to the free list. It may not cause any issues with the
> > current order setup, but moving after the addition to the free list
> > will make it so that you know it is in there when you call this
> > function.
> I will take a look at this.
> >
> >>         /*
> >>          * If this is not the largest possible page, check if the buddy
> >>          * of the next-highest order is free. If it is, it's possible
> >> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>                 if (unlikely(isolated_pageblocks))
> >>                         mt = get_pageblock_migratetype(page);
> >>
> >> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> >> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> >>                 trace_mm_page_pcpu_drain(page, 0, mt);
> >>         }
> >>         spin_unlock(&zone->lock);
> >> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>  static void free_one_page(struct zone *zone,
> >>                                 struct page *page, unsigned long pfn,
> >>                                 unsigned int order,
> >> -                               int migratetype)
> >> +                               int migratetype, bool hint)
> >>  {
> >>         spin_lock(&zone->lock);
> >>         if (unlikely(has_isolate_pageblock(zone) ||
> >>                 is_migrate_isolate(migratetype))) {
> >>                 migratetype = get_pfnblock_migratetype(page, pfn);
> >>         }
> >> -       __free_one_page(page, pfn, zone, order, migratetype);
> >> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
> >>         spin_unlock(&zone->lock);
> >>  }
> >>
> >> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >>         migratetype = get_pfnblock_migratetype(page, pfn);
> >>         local_irq_save(flags);
> >>         __count_vm_events(PGFREE, 1 << order);
> >> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> >> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
> >>         local_irq_restore(flags);
> >>  }
> >>
> >> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
> >>          */
> >>         if (migratetype >= MIGRATE_PCPTYPES) {
> >>                 if (unlikely(is_migrate_isolate(migratetype))) {
> >> -                       free_one_page(zone, page, pfn, 0, migratetype);
> >> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
> >>                         return;
> >>                 }
> >>                 migratetype = MIGRATE_MOVABLE;
> > So it looks like you are using a parameter to identify if the page is
> > a hinted page or not. I guess this works but it seems like it is a bit
> > intrusive as you are adding an argument to specify that this is a
> > specific page type.
> Any suggestions on how we could do this in a less intrusive manner?

The quick approach would be to add some piece of metadata somewhere in
the page that you could trigger off of. If you could do that then drop
the need for all these extra checks and instead just not perform the
notification on the pages. I really don't think the addition of the
"Treated" flag was all that invasive, at least within the kernel. It
would allow you to avoid all the changes to free_one_page, and
__free_one_page.

> >
> >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> >> new file mode 100644
> >> index 000000000000..7341c6462de2
> >> --- /dev/null
> >> +++ b/mm/page_hinting.c
> >> @@ -0,0 +1,236 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Page hinting support to enable a VM to report the freed pages back
> >> + * to the host.
> >> + *
> >> + * Copyright Red Hat, Inc. 2019
> >> + *
> >> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> >> + */
> >> +
> >> +#include <linux/mm.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/page_hinting.h>
> >> +#include <linux/kvm_host.h>
> >> +
> >> +/*
> >> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
> >> + * and other required parameters which could help in retrieving the original
> >> + * PFN value using the bitmap.
> >> + * @bitmap:            Pointer to the bitmap of free PFN.
> >> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
> >> + * @free_pages:                Tracks the number of free pages of granularity
> >> + *                     PAGE_HINTING_MIN_ORDER.
> >> + * @nbits:             Indicates the total size of the bitmap in bits allocated
> >> + *                     at the time of initialization.
> >> + */
> >> +struct hinting_bitmap {
> >> +       unsigned long *bitmap;
> >> +       unsigned long base_pfn;
> >> +       atomic_t free_pages;
> >> +       unsigned long nbits;
> >> +} bm_zone[MAX_NR_ZONES];
> >> +
> > This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?
> I will have to look into this.

So it doesn't cause a panic, but with 2 NUMA nodes you are only
hinting on half the memory. I was able to build, test,  and verify
this. I had resolved it by simply multiplying MAX_NR_ZONES by
MAX_NUMNODES, and splitting my indices between node and zone.

> >
> >> +static void init_hinting_wq(struct work_struct *work);
> >> +extern int __isolate_free_page(struct page *page, unsigned int order);
> >> +extern void __free_one_page(struct page *page, unsigned long pfn,
> >> +                           struct zone *zone, unsigned int order,
> >> +                           int migratetype, bool hint);
> >> +const struct page_hinting_cb *hcb;
> >> +struct work_struct hinting_work;
> >> +
> >> +static unsigned long find_bitmap_size(struct zone *zone)
> >> +{
> >> +       unsigned long nbits = ALIGN(zone->spanned_pages,
> >> +                           PAGE_HINTING_MIN_ORDER);
> >> +
> >> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
> >> +       return nbits;
> >> +}
> >> +
> > This doesn't look right to me. You are trying to do something like a
> > DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
> > PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
> > Another option would be to just do DIV_ROUND_UP with the 1 <<
> > PAGE_HINTING_MIN_ORDER value.
> I will double check this.
> >
> >> +void page_hinting_enable(const struct page_hinting_cb *callback)
> >> +{
> >> +       struct zone *zone;
> >> +       int idx = 0;
> >> +       unsigned long bitmap_size = 0;
> >> +
> >> +       for_each_populated_zone(zone) {
> > The index for this doesn't match up to the index you used to define
> > bm_zone. for_each_populated_zone will go through each zone in each
> > pgdat. Right now you can only handle one pgdat.
> Not sure if I understood this entirely. Can you please explain more on this?
> >
> >> +               spin_lock(&zone->lock);
> >> +               bitmap_size = find_bitmap_size(zone);
> >> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> >> +               if (!bm_zone[idx].bitmap)
> >> +                       return;
> >> +               bm_zone[idx].nbits = bitmap_size;
> >> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
> >> +               spin_unlock(&zone->lock);
> >> +               idx++;
> >> +       }
> >> +       hcb = callback;
> >> +       INIT_WORK(&hinting_work, init_hinting_wq);
> >> +}
> >> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> >> +
> >> +void page_hinting_disable(void)
> >> +{
> >> +       struct zone *zone;
> >> +       int idx = 0;
> >> +
> >> +       cancel_work_sync(&hinting_work);
> >> +       hcb = NULL;
> >> +       for_each_populated_zone(zone) {
> >> +               spin_lock(&zone->lock);
> >> +               bitmap_free(bm_zone[idx].bitmap);
> >> +               bm_zone[idx].base_pfn = 0;
> >> +               bm_zone[idx].nbits = 0;
> >> +               atomic_set(&bm_zone[idx].free_pages, 0);
> >> +               spin_unlock(&zone->lock);
> >> +               idx++;
> >> +       }
> >> +}
> >> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> >> +
> >> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
> >> +{
> >> +       unsigned long bitnr;
> >> +
> >> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
> >> +                        >> PAGE_HINTING_MIN_ORDER;
> >> +       return bitnr;
> >> +}
> >> +
> >> +static void release_buddy_pages(struct list_head *pages)
> >> +{
> >> +       int mt = 0, zonenum, order;
> >> +       struct page *page, *next;
> >> +       struct zone *zone;
> >> +       unsigned long bitnr;
> >> +
> >> +       list_for_each_entry_safe(page, next, pages, lru) {
> >> +               zonenum = page_zonenum(page);
> >> +               zone = page_zone(page);
> >> +               bitnr = pfn_to_bit(page, zonenum);
> >> +               spin_lock(&zone->lock);
> >> +               list_del(&page->lru);
> >> +               order = page_private(page);
> >> +               set_page_private(page, 0);
> >> +               mt = get_pageblock_migratetype(page);
> >> +               __free_one_page(page, page_to_pfn(page), zone,
> >> +                               order, mt, false);
> >> +               spin_unlock(&zone->lock);
> >> +       }
> >> +}
> >> +
> >> +static void bm_set_pfn(struct page *page)
> >> +{
> >> +       unsigned long bitnr = 0;
> >> +       int zonenum = page_zonenum(page);
> >> +       struct zone *zone = page_zone(page);
> >> +
> >> +       lockdep_assert_held(&zone->lock);
> >> +       bitnr = pfn_to_bit(page, zonenum);
> >> +       if (bm_zone[zonenum].bitmap &&
> >> +           bitnr < bm_zone[zonenum].nbits &&
> >> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
> >> +               atomic_inc(&bm_zone[zonenum].free_pages);
> >> +}
> >> +
> >> +static void scan_hinting_bitmap(int zonenum, int free_pages)
> >> +{
> >> +       unsigned long set_bit, start = 0;
> >> +       struct page *page;
> >> +       struct zone *zone;
> >> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
> >> +       LIST_HEAD(isolated_pages);
> >> +
> >> +       ret = hcb->prepare();
> >> +       if (ret < 0)
> >> +               return;
> >> +       for (;;) {
> >> +               ret = 0;
> >> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
> >> +                                       bm_zone[zonenum].nbits, start);
> >> +               if (set_bit >= bm_zone[zonenum].nbits)
> >> +                       break;
> >> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> >> +                               bm_zone[zonenum].base_pfn);
> >> +               if (!page)
> >> +                       continue;
> >> +               zone = page_zone(page);
> >> +               spin_lock(&zone->lock);
> >> +
> >> +               if (PageBuddy(page) && page_private(page) >=
> >> +                   PAGE_HINTING_MIN_ORDER) {
> >> +                       order = page_private(page);
> >> +                       ret = __isolate_free_page(page, order);
> >> +               }
> >> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
> >> +               spin_unlock(&zone->lock);
> >> +               if (ret) {
> >> +                       /*
> >> +                        * restoring page order to use it while releasing
> >> +                        * the pages back to the buddy.
> >> +                        */
> >> +                       set_page_private(page, order);
> >> +                       list_add_tail(&page->lru, &isolated_pages);
> >> +                       isolated_cnt++;
> >> +                       if (isolated_cnt == hcb->max_pages) {
> >> +                               hcb->hint_pages(&isolated_pages);
> >> +                               release_buddy_pages(&isolated_pages);
> >> +                               isolated_cnt = 0;
> >> +                       }
> >> +               }
> >> +               start = set_bit + 1;
> >> +               scanned_pages++;
> >> +       }
> >> +       if (isolated_cnt) {
> >> +               hcb->hint_pages(&isolated_pages);
> >> +               release_buddy_pages(&isolated_pages);
> >> +       }
> >> +       hcb->cleanup();
> >> +       if (scanned_pages > free_pages)
> >> +               atomic_sub((scanned_pages - free_pages),
> >> +                          &bm_zone[zonenum].free_pages);
> >> +}
> >> +
> >> +static bool check_hinting_threshold(void)
> >> +{
> >> +       int zonenum = 0;
> >> +
> >> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> >> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
> >> +                               hcb->max_pages)
> >> +                       return true;
> >> +       }
> >> +       return false;
> >> +}
> >> +
> >> +static void init_hinting_wq(struct work_struct *work)
> >> +{
> >> +       int zonenum = 0, free_pages = 0;
> >> +
> >> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> >> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
> >> +               if (free_pages >= hcb->max_pages) {
> >> +                       /* Find a better way to synchronize per zone
> >> +                        * free_pages.
> >> +                        */
> >> +                       atomic_sub(free_pages,
> >> +                                  &bm_zone[zonenum].free_pages);
> >> +                       scan_hinting_bitmap(zonenum, free_pages);
> >> +               }
> >> +       }
> >> +}
> >> +
> >> +void page_hinting_enqueue(struct page *page, int order)
> >> +{
> >> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
> >> +               bm_set_pfn(page);
> >> +       else
> >> +               return;
> > You could probably flip the logic and save yourself an "else" by just
> > doing something like:
> > if (!hcb || order < PAGE_HINTING_MIN_ORDER)
> >         return;
> >
> > I think it would also make this more readable.
> >
> +1
> >> +
> >> +       if (check_hinting_threshold()) {
> >> +               int cpu = smp_processor_id();
> >> +
> >> +               queue_work_on(cpu, system_wq, &hinting_work);
> >> +       }
> >> +}
> >> --
> >> 2.21.0
> >>
> --
> Regards
> Nitesh
>
Nitesh Narayan Lal June 4, 2019, 4:07 p.m. UTC | #6
On 6/4/19 11:14 AM, Alexander Duyck wrote:
> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 6/3/19 3:04 PM, Alexander Duyck wrote:
>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page hinting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be hinted to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> There are still various things to look into (e.g., memory hotplug, more
>>>> efficient locking, possible races when disabling).
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> So one thing I had thought about, that I don't believe that has been
>>> addressed in your solution, is to determine a means to guarantee
>>> forward progress. If you have a noisy thread that is allocating and
>>> freeing some block of memory repeatedly you will be stuck processing
>>> that and cannot get to the other work. Specifically if you have a zone
>>> where somebody is just cycling the number of pages needed to fill your
>>> hinting queue how do you get around it and get to the data that is
>>> actually code instead of getting stuck processing the noise?
>> It should not matter. As every time the memory threshold is met, entire
>> bitmap
>> is scanned and not just a chunk of memory for possible isolation. This
>> will guarantee
>> forward progress.
> So I think there may still be some issues. I see how you go from the
> start to the end, but how to you loop back to the start again as pages
> are added? The init_hinting_wq doesn't seem to have a way to get back
> to the start again if there is still work to do after you have
> completed your pass without queue_work_on firing off another thread.
>
That will be taken care as the part of a new job, which will be
en-queued as soon
as the free memory count for the respective zone will reach the threshold.

>>> Do you have any idea what the hit rate would be on a system that is on
>>> the more active side? From what I can tell you still are effectively
>>> just doing a linear search of memory, but you have the bitmap hints to
>>> tell what as not been freed recently, however you still don't know
>>> that the pages you have bitmap hints for are actually free until you
>>> check them.
>>>
>>>> ---
>>>>  drivers/virtio/Kconfig       |   1 +
>>>>  include/linux/page_hinting.h |  46 +++++++
>>>>  mm/Kconfig                   |   6 +
>>>>  mm/Makefile                  |   2 +
>>>>  mm/page_alloc.c              |  17 +--
>>>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>>>  create mode 100644 include/linux/page_hinting.h
>>>>  create mode 100644 mm/page_hinting.c
>>>>
>>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>>> index 35897649c24f..5a96b7a2ed1e 100644
>>>> --- a/drivers/virtio/Kconfig
>>>> +++ b/drivers/virtio/Kconfig
>>>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>>>         tristate "Virtio balloon driver"
>>>>         depends on VIRTIO
>>>>         select MEMORY_BALLOON
>>>> +       select PAGE_HINTING
>>>>         ---help---
>>>>          This driver supports increasing and decreasing the amount
>>>>          of memory within a KVM guest.
>>>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>>>> new file mode 100644
>>>> index 000000000000..e65188fe1e6b
>>>> --- /dev/null
>>>> +++ b/include/linux/page_hinting.h
>>>> @@ -0,0 +1,46 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef _LINUX_PAGE_HINTING_H
>>>> +#define _LINUX_PAGE_HINTING_H
>>>> +
>>>> +/*
>>>> + * Minimum page order required for a page to be hinted to the host.
>>>> + */
>>>> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
>>>> +
>>>> +/*
>>>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>>>> + * isolated pages.
>>>> + * @prepare:           Callback responsible for allocating an array to hold
>>>> + *                     the isolated pages.
>>>> + * @hint_pages:                Callback which reports the isolated pages synchornously
>>>> + *                     to the host.
>>>> + * @cleanup:           Callback to free the the array used for reporting the
>>>> + *                     isolated pages.
>>>> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
>>>> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>>>> + */
>>>> +struct page_hinting_cb {
>>>> +       int (*prepare)(void);
>>>> +       void (*hint_pages)(struct list_head *list);
>>>> +       void (*cleanup)(void);
>>>> +       int max_pages;
>>>> +};
>>>> +
>>>> +#ifdef CONFIG_PAGE_HINTING
>>>> +void page_hinting_enqueue(struct page *page, int order);
>>>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>>>> +void page_hinting_disable(void);
>>>> +#else
>>>> +static inline void page_hinting_enqueue(struct page *page, int order)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void page_hinting_disable(void)
>>>> +{
>>>> +}
>>>> +#endif
>>>> +#endif /* _LINUX_PAGE_HINTING_H */
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index ee8d1f311858..177d858de758 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>>>  config ARCH_HAS_PTE_SPECIAL
>>>>         bool
>>>>
>>>> +# PAGE_HINTING will allow the guest to report the free pages to the
>>>> +# host in regular interval of time.
>>>> +config PAGE_HINTING
>>>> +       bool
>>>> +       def_bool n
>>>> +       depends on X86_64
>>>>  endmenu
>>>> diff --git a/mm/Makefile b/mm/Makefile
>>>> index ac5e5ba78874..bec456dfee34 100644
>>>> --- a/mm/Makefile
>>>> +++ b/mm/Makefile
>>>> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>>>>                            interval_tree.o list_lru.o workingset.o \
>>>>                            debug.o $(mmu-y)
>>>>
>>>> +
>>>>  # Give 'page_alloc' its own module-parameter namespace
>>>>  page-alloc-y := page_alloc.o
>>>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>>>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>>>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>>>  obj-$(CONFIG_CMA)      += cma.o
>>>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>>>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 3b13d3914176..d12f69e0e402 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -68,6 +68,7 @@
>>>>  #include <linux/lockdep.h>
>>>>  #include <linux/nmi.h>
>>>>  #include <linux/psi.h>
>>>> +#include <linux/page_hinting.h>
>>>>
>>>>  #include <asm/sections.h>
>>>>  #include <asm/tlbflush.h>
>>>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>>>   * -- nyc
>>>>   */
>>>>
>>>> -static inline void __free_one_page(struct page *page,
>>>> +inline void __free_one_page(struct page *page,
>>>>                 unsigned long pfn,
>>>>                 struct zone *zone, unsigned int order,
>>>> -               int migratetype)
>>>> +               int migratetype, bool hint)
>>>>  {
>>>>         unsigned long combined_pfn;
>>>>         unsigned long uninitialized_var(buddy_pfn);
>>>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>>>  done_merging:
>>>>         set_page_order(page, order);
>>>>
>>>> +       if (hint)
>>>> +               page_hinting_enqueue(page, order);
>>> This is a bit early to probably be dealing with the hint. You should
>>> probably look at moving this down to a spot somewhere after the page
>>> has been added to the free list. It may not cause any issues with the
>>> current order setup, but moving after the addition to the free list
>>> will make it so that you know it is in there when you call this
>>> function.
>> I will take a look at this.
>>>>         /*
>>>>          * If this is not the largest possible page, check if the buddy
>>>>          * of the next-highest order is free. If it is, it's possible
>>>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>                 if (unlikely(isolated_pageblocks))
>>>>                         mt = get_pageblock_migratetype(page);
>>>>
>>>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>>>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>>>         }
>>>>         spin_unlock(&zone->lock);
>>>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>  static void free_one_page(struct zone *zone,
>>>>                                 struct page *page, unsigned long pfn,
>>>>                                 unsigned int order,
>>>> -                               int migratetype)
>>>> +                               int migratetype, bool hint)
>>>>  {
>>>>         spin_lock(&zone->lock);
>>>>         if (unlikely(has_isolate_pageblock(zone) ||
>>>>                 is_migrate_isolate(migratetype))) {
>>>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>>>         }
>>>> -       __free_one_page(page, pfn, zone, order, migratetype);
>>>> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>>>>         spin_unlock(&zone->lock);
>>>>  }
>>>>
>>>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>>>         local_irq_save(flags);
>>>>         __count_vm_events(PGFREE, 1 << order);
>>>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>>>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>>>         local_irq_restore(flags);
>>>>  }
>>>>
>>>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>>>          */
>>>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>>>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>>>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>>>                         return;
>>>>                 }
>>>>                 migratetype = MIGRATE_MOVABLE;
>>> So it looks like you are using a parameter to identify if the page is
>>> a hinted page or not. I guess this works but it seems like it is a bit
>>> intrusive as you are adding an argument to specify that this is a
>>> specific page type.
>> Any suggestions on how we could do this in a less intrusive manner?
> The quick approach would be to add some piece of metadata somewhere in
> the page that you could trigger off of. If you could do that then drop
> the need for all these extra checks and instead just not perform the
> notification on the pages. I really don't think the addition of the
> "Treated" flag was all that invasive, at least within the kernel. It
> would allow you to avoid all the changes to free_one_page, and
> __free_one_page.
>
>>>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>>>> new file mode 100644
>>>> index 000000000000..7341c6462de2
>>>> --- /dev/null
>>>> +++ b/mm/page_hinting.c
>>>> @@ -0,0 +1,236 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Page hinting support to enable a VM to report the freed pages back
>>>> + * to the host.
>>>> + *
>>>> + * Copyright Red Hat, Inc. 2019
>>>> + *
>>>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>>>> + */
>>>> +
>>>> +#include <linux/mm.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/page_hinting.h>
>>>> +#include <linux/kvm_host.h>
>>>> +
>>>> +/*
>>>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>>>> + * and other required parameters which could help in retrieving the original
>>>> + * PFN value using the bitmap.
>>>> + * @bitmap:            Pointer to the bitmap of free PFN.
>>>> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
>>>> + * @free_pages:                Tracks the number of free pages of granularity
>>>> + *                     PAGE_HINTING_MIN_ORDER.
>>>> + * @nbits:             Indicates the total size of the bitmap in bits allocated
>>>> + *                     at the time of initialization.
>>>> + */
>>>> +struct hinting_bitmap {
>>>> +       unsigned long *bitmap;
>>>> +       unsigned long base_pfn;
>>>> +       atomic_t free_pages;
>>>> +       unsigned long nbits;
>>>> +} bm_zone[MAX_NR_ZONES];
>>>> +
>>> This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?
>> I will have to look into this.
> So it doesn't cause a panic, but with 2 NUMA nodes you are only
> hinting on half the memory. I was able to build, test,  and verify
> this. I had resolved it by simply multiplying MAX_NR_ZONES by
> MAX_NUMNODES, and splitting my indices between node and zone.
I see, Thanks.
>>>> +static void init_hinting_wq(struct work_struct *work);
>>>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>>>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>>>> +                           struct zone *zone, unsigned int order,
>>>> +                           int migratetype, bool hint);
>>>> +const struct page_hinting_cb *hcb;
>>>> +struct work_struct hinting_work;
>>>> +
>>>> +static unsigned long find_bitmap_size(struct zone *zone)
>>>> +{
>>>> +       unsigned long nbits = ALIGN(zone->spanned_pages,
>>>> +                           PAGE_HINTING_MIN_ORDER);
>>>> +
>>>> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>>>> +       return nbits;
>>>> +}
>>>> +
>>> This doesn't look right to me. You are trying to do something like a
>>> DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
>>> PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
>>> Another option would be to just do DIV_ROUND_UP with the 1 <<
>>> PAGE_HINTING_MIN_ORDER value.
>> I will double check this.
>>>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>>>> +{
>>>> +       struct zone *zone;
>>>> +       int idx = 0;
>>>> +       unsigned long bitmap_size = 0;
>>>> +
>>>> +       for_each_populated_zone(zone) {
>>> The index for this doesn't match up to the index you used to define
>>> bm_zone. for_each_populated_zone will go through each zone in each
>>> pgdat. Right now you can only handle one pgdat.
>> Not sure if I understood this entirely. Can you please explain more on this?
>>>> +               spin_lock(&zone->lock);
>>>> +               bitmap_size = find_bitmap_size(zone);
>>>> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>>>> +               if (!bm_zone[idx].bitmap)
>>>> +                       return;
>>>> +               bm_zone[idx].nbits = bitmap_size;
>>>> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
>>>> +               spin_unlock(&zone->lock);
>>>> +               idx++;
>>>> +       }
>>>> +       hcb = callback;
>>>> +       INIT_WORK(&hinting_work, init_hinting_wq);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>>>> +
>>>> +void page_hinting_disable(void)
>>>> +{
>>>> +       struct zone *zone;
>>>> +       int idx = 0;
>>>> +
>>>> +       cancel_work_sync(&hinting_work);
>>>> +       hcb = NULL;
>>>> +       for_each_populated_zone(zone) {
>>>> +               spin_lock(&zone->lock);
>>>> +               bitmap_free(bm_zone[idx].bitmap);
>>>> +               bm_zone[idx].base_pfn = 0;
>>>> +               bm_zone[idx].nbits = 0;
>>>> +               atomic_set(&bm_zone[idx].free_pages, 0);
>>>> +               spin_unlock(&zone->lock);
>>>> +               idx++;
>>>> +       }
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>>>> +
>>>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>>>> +{
>>>> +       unsigned long bitnr;
>>>> +
>>>> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>>>> +                        >> PAGE_HINTING_MIN_ORDER;
>>>> +       return bitnr;
>>>> +}
>>>> +
>>>> +static void release_buddy_pages(struct list_head *pages)
>>>> +{
>>>> +       int mt = 0, zonenum, order;
>>>> +       struct page *page, *next;
>>>> +       struct zone *zone;
>>>> +       unsigned long bitnr;
>>>> +
>>>> +       list_for_each_entry_safe(page, next, pages, lru) {
>>>> +               zonenum = page_zonenum(page);
>>>> +               zone = page_zone(page);
>>>> +               bitnr = pfn_to_bit(page, zonenum);
>>>> +               spin_lock(&zone->lock);
>>>> +               list_del(&page->lru);
>>>> +               order = page_private(page);
>>>> +               set_page_private(page, 0);
>>>> +               mt = get_pageblock_migratetype(page);
>>>> +               __free_one_page(page, page_to_pfn(page), zone,
>>>> +                               order, mt, false);
>>>> +               spin_unlock(&zone->lock);
>>>> +       }
>>>> +}
>>>> +
>>>> +static void bm_set_pfn(struct page *page)
>>>> +{
>>>> +       unsigned long bitnr = 0;
>>>> +       int zonenum = page_zonenum(page);
>>>> +       struct zone *zone = page_zone(page);
>>>> +
>>>> +       lockdep_assert_held(&zone->lock);
>>>> +       bitnr = pfn_to_bit(page, zonenum);
>>>> +       if (bm_zone[zonenum].bitmap &&
>>>> +           bitnr < bm_zone[zonenum].nbits &&
>>>> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>>>> +               atomic_inc(&bm_zone[zonenum].free_pages);
>>>> +}
>>>> +
>>>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>>>> +{
>>>> +       unsigned long set_bit, start = 0;
>>>> +       struct page *page;
>>>> +       struct zone *zone;
>>>> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>>>> +       LIST_HEAD(isolated_pages);
>>>> +
>>>> +       ret = hcb->prepare();
>>>> +       if (ret < 0)
>>>> +               return;
>>>> +       for (;;) {
>>>> +               ret = 0;
>>>> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>>>> +                                       bm_zone[zonenum].nbits, start);
>>>> +               if (set_bit >= bm_zone[zonenum].nbits)
>>>> +                       break;
>>>> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>>>> +                               bm_zone[zonenum].base_pfn);
>>>> +               if (!page)
>>>> +                       continue;
>>>> +               zone = page_zone(page);
>>>> +               spin_lock(&zone->lock);
>>>> +
>>>> +               if (PageBuddy(page) && page_private(page) >=
>>>> +                   PAGE_HINTING_MIN_ORDER) {
>>>> +                       order = page_private(page);
>>>> +                       ret = __isolate_free_page(page, order);
>>>> +               }
>>>> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
>>>> +               spin_unlock(&zone->lock);
>>>> +               if (ret) {
>>>> +                       /*
>>>> +                        * restoring page order to use it while releasing
>>>> +                        * the pages back to the buddy.
>>>> +                        */
>>>> +                       set_page_private(page, order);
>>>> +                       list_add_tail(&page->lru, &isolated_pages);
>>>> +                       isolated_cnt++;
>>>> +                       if (isolated_cnt == hcb->max_pages) {
>>>> +                               hcb->hint_pages(&isolated_pages);
>>>> +                               release_buddy_pages(&isolated_pages);
>>>> +                               isolated_cnt = 0;
>>>> +                       }
>>>> +               }
>>>> +               start = set_bit + 1;
>>>> +               scanned_pages++;
>>>> +       }
>>>> +       if (isolated_cnt) {
>>>> +               hcb->hint_pages(&isolated_pages);
>>>> +               release_buddy_pages(&isolated_pages);
>>>> +       }
>>>> +       hcb->cleanup();
>>>> +       if (scanned_pages > free_pages)
>>>> +               atomic_sub((scanned_pages - free_pages),
>>>> +                          &bm_zone[zonenum].free_pages);
>>>> +}
>>>> +
>>>> +static bool check_hinting_threshold(void)
>>>> +{
>>>> +       int zonenum = 0;
>>>> +
>>>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>>>> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
>>>> +                               hcb->max_pages)
>>>> +                       return true;
>>>> +       }
>>>> +       return false;
>>>> +}
>>>> +
>>>> +static void init_hinting_wq(struct work_struct *work)
>>>> +{
>>>> +       int zonenum = 0, free_pages = 0;
>>>> +
>>>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>>>> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>>>> +               if (free_pages >= hcb->max_pages) {
>>>> +                       /* Find a better way to synchronize per zone
>>>> +                        * free_pages.
>>>> +                        */
>>>> +                       atomic_sub(free_pages,
>>>> +                                  &bm_zone[zonenum].free_pages);
>>>> +                       scan_hinting_bitmap(zonenum, free_pages);
>>>> +               }
>>>> +       }
>>>> +}
>>>> +
>>>> +void page_hinting_enqueue(struct page *page, int order)
>>>> +{
>>>> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>>>> +               bm_set_pfn(page);
>>>> +       else
>>>> +               return;
>>> You could probably flip the logic and save yourself an "else" by just
>>> doing something like:
>>> if (!hcb || order < PAGE_HINTING_MIN_ORDER)
>>>         return;
>>>
>>> I think it would also make this more readable.
>>>
>> +1
>>>> +
>>>> +       if (check_hinting_threshold()) {
>>>> +               int cpu = smp_processor_id();
>>>> +
>>>> +               queue_work_on(cpu, system_wq, &hinting_work);
>>>> +       }
>>>> +}
>>>> --
>>>> 2.21.0
>>>>
>> --
>> Regards
>> Nitesh
>>
Alexander Duyck June 4, 2019, 4:25 p.m. UTC | #7
On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/4/19 11:14 AM, Alexander Duyck wrote:
> > On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 6/3/19 3:04 PM, Alexander Duyck wrote:
> >>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> This patch introduces the core infrastructure for free page hinting in
> >>>> virtual environments. It enables the kernel to track the free pages which
> >>>> can be reported to its hypervisor so that the hypervisor could
> >>>> free and reuse that memory as per its requirement.
> >>>>
> >>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>> via MADV_FREE), the guest must not use them, otherwise, data loss
> >>>> would be possible. To avoid such a situation, these pages are
> >>>> temporarily removed from the buddy. The amount of pages removed
> >>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>> in our case).
> >>>>
> >>>> To efficiently identify free pages that can to be hinted to the
> >>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>
> >>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>
> >>>> There are still various things to look into (e.g., memory hotplug, more
> >>>> efficient locking, possible races when disabling).
> >>>>
> >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>> So one thing I had thought about, that I don't believe that has been
> >>> addressed in your solution, is to determine a means to guarantee
> >>> forward progress. If you have a noisy thread that is allocating and
> >>> freeing some block of memory repeatedly you will be stuck processing
> >>> that and cannot get to the other work. Specifically if you have a zone
> >>> where somebody is just cycling the number of pages needed to fill your
> >>> hinting queue how do you get around it and get to the data that is
> >>> actually code instead of getting stuck processing the noise?
> >> It should not matter. As every time the memory threshold is met, entire
> >> bitmap
> >> is scanned and not just a chunk of memory for possible isolation. This
> >> will guarantee
> >> forward progress.
> > So I think there may still be some issues. I see how you go from the
> > start to the end, but how to you loop back to the start again as pages
> > are added? The init_hinting_wq doesn't seem to have a way to get back
> > to the start again if there is still work to do after you have
> > completed your pass without queue_work_on firing off another thread.
> >
> That will be taken care as the part of a new job, which will be
> en-queued as soon
> as the free memory count for the respective zone will reach the threshold.

So does that mean that you have multiple threads all calling
queue_work_on until you get below the threshold? If so it seems like
that would get expensive since that is an atomic test and set
operation that would be hammered until you get below that threshold.
Nitesh Narayan Lal June 4, 2019, 4:42 p.m. UTC | #8
On 6/4/19 12:25 PM, Alexander Duyck wrote:
> On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 6/4/19 11:14 AM, Alexander Duyck wrote:
>>> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 6/3/19 3:04 PM, Alexander Duyck wrote:
>>>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>>> This patch introduces the core infrastructure for free page hinting in
>>>>>> virtual environments. It enables the kernel to track the free pages which
>>>>>> can be reported to its hypervisor so that the hypervisor could
>>>>>> free and reuse that memory as per its requirement.
>>>>>>
>>>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>>>> would be possible. To avoid such a situation, these pages are
>>>>>> temporarily removed from the buddy. The amount of pages removed
>>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>>>> in our case).
>>>>>>
>>>>>> To efficiently identify free pages that can to be hinted to the
>>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>>>
>>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>>>
>>>>>> There are still various things to look into (e.g., memory hotplug, more
>>>>>> efficient locking, possible races when disabling).
>>>>>>
>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>>> So one thing I had thought about, that I don't believe that has been
>>>>> addressed in your solution, is to determine a means to guarantee
>>>>> forward progress. If you have a noisy thread that is allocating and
>>>>> freeing some block of memory repeatedly you will be stuck processing
>>>>> that and cannot get to the other work. Specifically if you have a zone
>>>>> where somebody is just cycling the number of pages needed to fill your
>>>>> hinting queue how do you get around it and get to the data that is
>>>>> actually code instead of getting stuck processing the noise?
>>>> It should not matter. As every time the memory threshold is met, entire
>>>> bitmap
>>>> is scanned and not just a chunk of memory for possible isolation. This
>>>> will guarantee
>>>> forward progress.
>>> So I think there may still be some issues. I see how you go from the
>>> start to the end, but how to you loop back to the start again as pages
>>> are added? The init_hinting_wq doesn't seem to have a way to get back
>>> to the start again if there is still work to do after you have
>>> completed your pass without queue_work_on firing off another thread.
>>>
>> That will be taken care as the part of a new job, which will be
>> en-queued as soon
>> as the free memory count for the respective zone will reach the threshold.
> So does that mean that you have multiple threads all calling
> queue_work_on until you get below the threshold?
Every time a page of order MAX_ORDER - 2 is added to the buddy, free
memory count will be incremented if the bit is not already set and its
value will be checked against the threshold.
>  If so it seems like
> that would get expensive since that is an atomic test and set
> operation that would be hammered until you get below that threshold.

Not sure if I understood "until you get below that threshold".
Can you please explain?
test_and_set_bit() will be called every time a page with MAX_ORDER -2
order is added to the buddy. (Not already hinted)
Alexander Duyck June 4, 2019, 5:12 p.m. UTC | #9
On Tue, Jun 4, 2019 at 9:42 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/4/19 12:25 PM, Alexander Duyck wrote:
> > On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 6/4/19 11:14 AM, Alexander Duyck wrote:
> >>> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> On 6/3/19 3:04 PM, Alexander Duyck wrote:
> >>>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>>>> This patch introduces the core infrastructure for free page hinting in
> >>>>>> virtual environments. It enables the kernel to track the free pages which
> >>>>>> can be reported to its hypervisor so that the hypervisor could
> >>>>>> free and reuse that memory as per its requirement.
> >>>>>>
> >>>>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
> >>>>>> would be possible. To avoid such a situation, these pages are
> >>>>>> temporarily removed from the buddy. The amount of pages removed
> >>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>>>> in our case).
> >>>>>>
> >>>>>> To efficiently identify free pages that can to be hinted to the
> >>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>>>
> >>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>>>
> >>>>>> There are still various things to look into (e.g., memory hotplug, more
> >>>>>> efficient locking, possible races when disabling).
> >>>>>>
> >>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>>>> So one thing I had thought about, that I don't believe that has been
> >>>>> addressed in your solution, is to determine a means to guarantee
> >>>>> forward progress. If you have a noisy thread that is allocating and
> >>>>> freeing some block of memory repeatedly you will be stuck processing
> >>>>> that and cannot get to the other work. Specifically if you have a zone
> >>>>> where somebody is just cycling the number of pages needed to fill your
> >>>>> hinting queue how do you get around it and get to the data that is
> >>>>> actually code instead of getting stuck processing the noise?
> >>>> It should not matter. As every time the memory threshold is met, entire
> >>>> bitmap
> >>>> is scanned and not just a chunk of memory for possible isolation. This
> >>>> will guarantee
> >>>> forward progress.
> >>> So I think there may still be some issues. I see how you go from the
> >>> start to the end, but how to you loop back to the start again as pages
> >>> are added? The init_hinting_wq doesn't seem to have a way to get back
> >>> to the start again if there is still work to do after you have
> >>> completed your pass without queue_work_on firing off another thread.
> >>>
> >> That will be taken care as the part of a new job, which will be
> >> en-queued as soon
> >> as the free memory count for the respective zone will reach the threshold.
> > So does that mean that you have multiple threads all calling
> > queue_work_on until you get below the threshold?
> Every time a page of order MAX_ORDER - 2 is added to the buddy, free
> memory count will be incremented if the bit is not already set and its
> value will be checked against the threshold.
> >  If so it seems like
> > that would get expensive since that is an atomic test and set
> > operation that would be hammered until you get below that threshold.
>
> Not sure if I understood "until you get below that threshold".
> Can you please explain?
> test_and_set_bit() will be called every time a page with MAX_ORDER -2
> order is added to the buddy. (Not already hinted)

I had overlooked the other paths that are already making use of the
test_and_set_bit(). What I was getting at specifically is that the
WORK_PENDING bit in the work struct is going to be getting hit every
time you add a new page. So it is adding yet another atomic operation
in addition to the increment and test_and_set_bit() that you were
already doing.

Generally you may want to look at trying to reduce how often you are
having to perform these atomic operations. So for example one thing
you could do is use something like an atomic_read before you do your
atomic_inc to determine if you are transitioning to a state where you
were below, and now you are above the threshold. Doing something like
that could save you on the number of calls you are making and save
some significant CPU cycles.
David Hildenbrand June 14, 2019, 7:24 a.m. UTC | #10
On 03.06.19 19:03, Nitesh Narayan Lal wrote:
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
> 
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
> 
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
> 
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
> 
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/Kconfig       |   1 +
>  include/linux/page_hinting.h |  46 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   2 +
>  mm/page_alloc.c              |  17 +--
>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..5a96b7a2ed1e 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_HINTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.

BTW, this hunk belongs to the virtio-balloon patch.
diff mbox series

Patch

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 35897649c24f..5a96b7a2ed1e 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -46,6 +46,7 @@  config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select PAGE_HINTING
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
new file mode 100644
index 000000000000..e65188fe1e6b
--- /dev/null
+++ b/include/linux/page_hinting.h
@@ -0,0 +1,46 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_HINTING_H
+#define _LINUX_PAGE_HINTING_H
+
+/*
+ * Minimum page order required for a page to be hinted to the host.
+ */
+#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
+
+/*
+ * struct page_hinting_cb: holds the callbacks to store, report and cleanup
+ * isolated pages.
+ * @prepare:		Callback responsible for allocating an array to hold
+ *			the isolated pages.
+ * @hint_pages:		Callback which reports the isolated pages synchornously
+ *			to the host.
+ * @cleanup:		Callback to free the the array used for reporting the
+ *			isolated pages.
+ * @max_pages:		Maxmimum pages that are going to be hinted to the host
+ *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
+ */
+struct page_hinting_cb {
+	int (*prepare)(void);
+	void (*hint_pages)(struct list_head *list);
+	void (*cleanup)(void);
+	int max_pages;
+};
+
+#ifdef CONFIG_PAGE_HINTING
+void page_hinting_enqueue(struct page *page, int order);
+void page_hinting_enable(const struct page_hinting_cb *cb);
+void page_hinting_disable(void);
+#else
+static inline void page_hinting_enqueue(struct page *page, int order)
+{
+}
+
+static inline void page_hinting_enable(struct page_hinting_cb *cb)
+{
+}
+
+static inline void page_hinting_disable(void)
+{
+}
+#endif
+#endif /* _LINUX_PAGE_HINTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ee8d1f311858..177d858de758 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -764,4 +764,10 @@  config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+# PAGE_HINTING will allow the guest to report the free pages to the
+# host in regular interval of time.
+config PAGE_HINTING
+       bool
+       def_bool n
+       depends on X86_64
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ac5e5ba78874..bec456dfee34 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -41,6 +41,7 @@  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
+
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
@@ -94,6 +95,7 @@  obj-$(CONFIG_Z3FOLD)	+= z3fold.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b13d3914176..d12f69e0e402 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@ 
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_hinting.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -873,10 +874,10 @@  compaction_capture(struct capture_control *capc, struct page *page,
  * -- nyc
  */
 
-static inline void __free_one_page(struct page *page,
+inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool hint)
 {
 	unsigned long combined_pfn;
 	unsigned long uninitialized_var(buddy_pfn);
@@ -951,6 +952,8 @@  static inline void __free_one_page(struct page *page,
 done_merging:
 	set_page_order(page, order);
 
+	if (hint)
+		page_hinting_enqueue(page, order);
 	/*
 	 * If this is not the largest possible page, check if the buddy
 	 * of the next-highest order is free. If it is, it's possible
@@ -1262,7 +1265,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1271,14 +1274,14 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
 				unsigned int order,
-				int migratetype)
+				int migratetype, bool hint)
 {
 	spin_lock(&zone->lock);
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, hint);
 	spin_unlock(&zone->lock);
 }
 
@@ -1368,7 +1371,7 @@  static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
 	local_irq_restore(flags);
 }
 
@@ -2968,7 +2971,7 @@  static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype, true);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
diff --git a/mm/page_hinting.c b/mm/page_hinting.c
new file mode 100644
index 000000000000..7341c6462de2
--- /dev/null
+++ b/mm/page_hinting.c
@@ -0,0 +1,236 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Page hinting support to enable a VM to report the freed pages back
+ * to the host.
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/page_hinting.h>
+#include <linux/kvm_host.h>
+
+/*
+ * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
+ * and other required parameters which could help in retrieving the original
+ * PFN value using the bitmap.
+ * @bitmap:		Pointer to the bitmap of free PFN.
+ * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
+ * @free_pages:		Tracks the number of free pages of granularity
+ *			PAGE_HINTING_MIN_ORDER.
+ * @nbits:		Indicates the total size of the bitmap in bits allocated
+ *			at the time of initialization.
+ */
+struct hinting_bitmap {
+	unsigned long *bitmap;
+	unsigned long base_pfn;
+	atomic_t free_pages;
+	unsigned long nbits;
+} bm_zone[MAX_NR_ZONES];
+
+static void init_hinting_wq(struct work_struct *work);
+extern int __isolate_free_page(struct page *page, unsigned int order);
+extern void __free_one_page(struct page *page, unsigned long pfn,
+			    struct zone *zone, unsigned int order,
+			    int migratetype, bool hint);
+const struct page_hinting_cb *hcb;
+struct work_struct hinting_work;
+
+static unsigned long find_bitmap_size(struct zone *zone)
+{
+	unsigned long nbits = ALIGN(zone->spanned_pages,
+			    PAGE_HINTING_MIN_ORDER);
+
+	nbits = nbits >> PAGE_HINTING_MIN_ORDER;
+	return nbits;
+}
+
+void page_hinting_enable(const struct page_hinting_cb *callback)
+{
+	struct zone *zone;
+	int idx = 0;
+	unsigned long bitmap_size = 0;
+
+	for_each_populated_zone(zone) {
+		spin_lock(&zone->lock);
+		bitmap_size = find_bitmap_size(zone);
+		bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
+		if (!bm_zone[idx].bitmap)
+			return;
+		bm_zone[idx].nbits = bitmap_size;
+		bm_zone[idx].base_pfn = zone->zone_start_pfn;
+		spin_unlock(&zone->lock);
+		idx++;
+	}
+	hcb = callback;
+	INIT_WORK(&hinting_work, init_hinting_wq);
+}
+EXPORT_SYMBOL_GPL(page_hinting_enable);
+
+void page_hinting_disable(void)
+{
+	struct zone *zone;
+	int idx = 0;
+
+	cancel_work_sync(&hinting_work);
+	hcb = NULL;
+	for_each_populated_zone(zone) {
+		spin_lock(&zone->lock);
+		bitmap_free(bm_zone[idx].bitmap);
+		bm_zone[idx].base_pfn = 0;
+		bm_zone[idx].nbits = 0;
+		atomic_set(&bm_zone[idx].free_pages, 0);
+		spin_unlock(&zone->lock);
+		idx++;
+	}
+}
+EXPORT_SYMBOL_GPL(page_hinting_disable);
+
+static unsigned long pfn_to_bit(struct page *page, int zonenum)
+{
+	unsigned long bitnr;
+
+	bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
+			 >> PAGE_HINTING_MIN_ORDER;
+	return bitnr;
+}
+
+static void release_buddy_pages(struct list_head *pages)
+{
+	int mt = 0, zonenum, order;
+	struct page *page, *next;
+	struct zone *zone;
+	unsigned long bitnr;
+
+	list_for_each_entry_safe(page, next, pages, lru) {
+		zonenum = page_zonenum(page);
+		zone = page_zone(page);
+		bitnr = pfn_to_bit(page, zonenum);
+		spin_lock(&zone->lock);
+		list_del(&page->lru);
+		order = page_private(page);
+		set_page_private(page, 0);
+		mt = get_pageblock_migratetype(page);
+		__free_one_page(page, page_to_pfn(page), zone,
+				order, mt, false);
+		spin_unlock(&zone->lock);
+	}
+}
+
+static void bm_set_pfn(struct page *page)
+{
+	unsigned long bitnr = 0;
+	int zonenum = page_zonenum(page);
+	struct zone *zone = page_zone(page);
+
+	lockdep_assert_held(&zone->lock);
+	bitnr = pfn_to_bit(page, zonenum);
+	if (bm_zone[zonenum].bitmap &&
+	    bitnr < bm_zone[zonenum].nbits &&
+	    !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
+		atomic_inc(&bm_zone[zonenum].free_pages);
+}
+
+static void scan_hinting_bitmap(int zonenum, int free_pages)
+{
+	unsigned long set_bit, start = 0;
+	struct page *page;
+	struct zone *zone;
+	int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
+	LIST_HEAD(isolated_pages);
+
+	ret = hcb->prepare();
+	if (ret < 0)
+		return;
+	for (;;) {
+		ret = 0;
+		set_bit = find_next_bit(bm_zone[zonenum].bitmap,
+					bm_zone[zonenum].nbits, start);
+		if (set_bit >= bm_zone[zonenum].nbits)
+			break;
+		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
+				bm_zone[zonenum].base_pfn);
+		if (!page)
+			continue;
+		zone = page_zone(page);
+		spin_lock(&zone->lock);
+
+		if (PageBuddy(page) && page_private(page) >=
+		    PAGE_HINTING_MIN_ORDER) {
+			order = page_private(page);
+			ret = __isolate_free_page(page, order);
+		}
+		clear_bit(set_bit, bm_zone[zonenum].bitmap);
+		spin_unlock(&zone->lock);
+		if (ret) {
+			/*
+			 * restoring page order to use it while releasing
+			 * the pages back to the buddy.
+			 */
+			set_page_private(page, order);
+			list_add_tail(&page->lru, &isolated_pages);
+			isolated_cnt++;
+			if (isolated_cnt == hcb->max_pages) {
+				hcb->hint_pages(&isolated_pages);
+				release_buddy_pages(&isolated_pages);
+				isolated_cnt = 0;
+			}
+		}
+		start = set_bit + 1;
+		scanned_pages++;
+	}
+	if (isolated_cnt) {
+		hcb->hint_pages(&isolated_pages);
+		release_buddy_pages(&isolated_pages);
+	}
+	hcb->cleanup();
+	if (scanned_pages > free_pages)
+		atomic_sub((scanned_pages - free_pages),
+			   &bm_zone[zonenum].free_pages);
+}
+
+static bool check_hinting_threshold(void)
+{
+	int zonenum = 0;
+
+	for (; zonenum < MAX_NR_ZONES; zonenum++) {
+		if (atomic_read(&bm_zone[zonenum].free_pages) >=
+				hcb->max_pages)
+			return true;
+	}
+	return false;
+}
+
+static void init_hinting_wq(struct work_struct *work)
+{
+	int zonenum = 0, free_pages = 0;
+
+	for (; zonenum < MAX_NR_ZONES; zonenum++) {
+		free_pages = atomic_read(&bm_zone[zonenum].free_pages);
+		if (free_pages >= hcb->max_pages) {
+			/* Find a better way to synchronize per zone
+			 * free_pages.
+			 */
+			atomic_sub(free_pages,
+				   &bm_zone[zonenum].free_pages);
+			scan_hinting_bitmap(zonenum, free_pages);
+		}
+	}
+}
+
+void page_hinting_enqueue(struct page *page, int order)
+{
+	if (hcb && order >= PAGE_HINTING_MIN_ORDER)
+		bm_set_pfn(page);
+	else
+		return;
+
+	if (check_hinting_threshold()) {
+		int cpu = smp_processor_id();
+
+		queue_work_on(cpu, system_wq, &hinting_work);
+	}
+}