Message ID | 20190710195158.19640-2-nitesh@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Support for page hinting | expand |
On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: > +struct zone_free_area { > + unsigned long *bitmap; > + unsigned long base_pfn; > + unsigned long end_pfn; > + atomic_t free_pages; > + unsigned long nbits; > +} free_area[MAX_NR_ZONES]; Why do we need an extra data structure. What's wrong with putting per-zone data in ... 'struct zone'? The cover letter claims that it doesn't touch core-mm infrastructure, but if it depends on mechanisms like this, I think that's a very bad thing. To be honest, I'm not sure this series is worth reviewing at this point. It's horribly lightly commented and full of kernel antipatterns lik void func() { if () { ... indent entire logic ... of function } } It has big "TODO"s. It's virtually comment-free. I'm shocked it's at the 11th version and still looking like this. > + > + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { > + unsigned long pages = free_area[zone_idx].end_pfn - > + free_area[zone_idx].base_pfn; > + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; > + if (!bitmap_size) > + continue; > + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, > + GFP_KERNEL); This doesn't support sparse zones. We can have zones with massive spanned page sizes, but very few present pages. On those zones, this will exhaust memory for no good reason. Comparing this to Alex's patch set, it's of much lower quality and at a much earlier stage of development. The two sets are not really even comparable right now. This certainly doesn't sell me on (or even really enumerate the deltas in) this approach vs. Alex's.
On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > > This patch introduces the core infrastructure for free page hinting in > virtual environments. It enables the kernel to track the free pages which > can be reported to its hypervisor so that the hypervisor could > free and reuse that memory as per its requirement. > > While the pages are getting processed in the hypervisor (e.g., > via MADV_FREE), the guest must not use them, otherwise, data loss > would be possible. To avoid such a situation, these pages are > temporarily removed from the buddy. The amount of pages removed > temporarily from the buddy is governed by the backend(virtio-balloon > in our case). > > To efficiently identify free pages that can to be hinted to the > hypervisor, bitmaps in a coarse granularity are used. Only fairly big > chunks are reported to the hypervisor - especially, to not break up THP > in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits > in the bitmap are an indication whether a page *might* be free, not a > guarantee. A new hook after buddy merging sets the bits. > > Bitmaps are stored per zone, protected by the zone lock. A workqueue > asynchronously processes the bitmaps, trying to isolate and report pages > that are still free. The backend (virtio-balloon) is responsible for > reporting these batched pages to the host synchronously. Once reporting/ > freeing is complete, isolated pages are returned back to the buddy. > > There are still various things to look into (e.g., memory hotplug, more > efficient locking, possible races when disabling). > > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> > --- > include/linux/page_hinting.h | 45 +++++++ > mm/Kconfig | 6 + > mm/Makefile | 1 + > mm/page_alloc.c | 18 +-- > mm/page_hinting.c | 250 +++++++++++++++++++++++++++++++++++ > 5 files changed, 312 insertions(+), 8 deletions(-) > create mode 100644 include/linux/page_hinting.h > create mode 100644 mm/page_hinting.c > > diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h > new file mode 100644 > index 000000000000..4900feb796f9 > --- /dev/null > +++ b/include/linux/page_hinting.h > @@ -0,0 +1,45 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_PAGE_HINTING_H > +#define _LINUX_PAGE_HINTING_H > + > +/* > + * Minimum page order required for a page to be hinted to the host. > + */ > +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) > + Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed out earlier for is it due to something else? I'm just wondering if this will have an impact on architectures outside of x86 as I had chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I don't know that the impact of doing that is on other architectures versus the (MAX_ORDER - 2) approach you took here. > +/* > + * struct page_hinting_config: holds the information supplied by the balloon > + * device to page hinting. > + * @hint_pages: Callback which reports the isolated pages > + * synchornously to the host. > + * @max_pages: Maxmimum pages that are going to be hinted to the host > + * at a time of granularity >= PAGE_HINTING_MIN_ORDER. > + */ > +struct page_hinting_config { > + void (*hint_pages)(struct list_head *list); > + int max_pages; > +}; > + > +extern int __isolate_free_page(struct page *page, unsigned int order); > +extern void __free_one_page(struct page *page, unsigned long pfn, > + struct zone *zone, unsigned int order, > + int migratetype, bool hint); > +#ifdef CONFIG_PAGE_HINTING > +void page_hinting_enqueue(struct page *page, int order); > +int page_hinting_enable(const struct page_hinting_config *conf); > +void page_hinting_disable(void); > +#else > +static inline void page_hinting_enqueue(struct page *page, int order) > +{ > +} > + > +static inline int page_hinting_enable(const struct page_hinting_config *conf) > +{ > + return -EOPNOTSUPP; > +} > + > +static inline void page_hinting_disable(void) > +{ > +} > +#endif > +#endif /* _LINUX_PAGE_HINTING_H */ > diff --git a/mm/Kconfig b/mm/Kconfig > index f0c76ba47695..e97fab429d9b 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -765,4 +765,10 @@ config GUP_BENCHMARK > config ARCH_HAS_PTE_SPECIAL > bool > > +# PAGE_HINTING will allow the guest to report the free pages to the > +# host in fixed chunks as soon as the threshold is reached. > +config PAGE_HINTING > + bool > + def_bool n > + depends on X86_64 > endmenu If there are no issue with using the term "PAGE_HINTING" I guess I will update my patch set to use that term instead of aeration. > diff --git a/mm/Makefile b/mm/Makefile > index ac5e5ba78874..73be49177656 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o > obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o > obj-$(CONFIG_CMA) += cma.o > obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o > +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o > obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o > obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o > obj-$(CONFIG_USERFAULTFD) += userfaultfd.o > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d66bc8abe0af..8a44338bd04e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -69,6 +69,7 @@ > #include <linux/lockdep.h> > #include <linux/nmi.h> > #include <linux/psi.h> > +#include <linux/page_hinting.h> > > #include <asm/sections.h> > #include <asm/tlbflush.h> > @@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page, > * -- nyc > */ > > -static inline void __free_one_page(struct page *page, > +inline void __free_one_page(struct page *page, > unsigned long pfn, > struct zone *zone, unsigned int order, > - int migratetype) > + int migratetype, bool hint) > { > unsigned long combined_pfn; > unsigned long uninitialized_var(buddy_pfn); > @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page, > migratetype); > else > add_to_free_area(page, &zone->free_area[order], migratetype); > - > + if (hint) > + page_hinting_enqueue(page, order); > } I'm not sure I am a fan of the way the word "hint" is used here. At first I thought this was supposed to be !hint since I thought hint meant that it was a hinted page, not that we need to record that this page has been freed. Maybe "record" or "report" might be a better word to use here. > /* > @@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, > if (unlikely(isolated_pageblocks)) > mt = get_pageblock_migratetype(page); > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); > trace_mm_page_pcpu_drain(page, 0, mt); > } > spin_unlock(&zone->lock); > @@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count, > static void free_one_page(struct zone *zone, > struct page *page, unsigned long pfn, > unsigned int order, > - int migratetype) > + int migratetype, bool hint) > { > spin_lock(&zone->lock); > if (unlikely(has_isolate_pageblock(zone) || > is_migrate_isolate(migratetype))) { > migratetype = get_pfnblock_migratetype(page, pfn); > } > - __free_one_page(page, pfn, zone, order, migratetype); > + __free_one_page(page, pfn, zone, order, migratetype, hint); > spin_unlock(&zone->lock); > } > > @@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) > migratetype = get_pfnblock_migratetype(page, pfn); > local_irq_save(flags); > __count_vm_events(PGFREE, 1 << order); > - free_one_page(page_zone(page), page, pfn, order, migratetype); > + free_one_page(page_zone(page), page, pfn, order, migratetype, true); > local_irq_restore(flags); > } > > @@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) > */ > if (migratetype >= MIGRATE_PCPTYPES) { > if (unlikely(is_migrate_isolate(migratetype))) { > - free_one_page(zone, page, pfn, 0, migratetype); > + free_one_page(zone, page, pfn, 0, migratetype, true); > return; > } > migratetype = MIGRATE_MOVABLE; > diff --git a/mm/page_hinting.c b/mm/page_hinting.c > new file mode 100644 > index 000000000000..0bfa09f8c3ed > --- /dev/null > +++ b/mm/page_hinting.c > @@ -0,0 +1,250 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Page hinting core infrastructure to enable a VM to report free pages to its > + * hypervisor. > + * > + * Copyright Red Hat, Inc. 2019 > + * > + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com> > + */ > + > +#include <linux/mm.h> > +#include <linux/slab.h> > +#include <linux/page_hinting.h> > +#include <linux/kvm_host.h> > + > +/* > + * struct zone_free_area: For a single zone across NUMA nodes, it holds the > + * bitmap pointer to track the free pages and other required parameters > + * used to recover these pages by scanning the bitmap. > + * @bitmap: Pointer to the bitmap in PAGE_HINTING_MIN_ORDER > + * granularity. > + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. > + * @end_pfn: Indicates the last PFN value for the zone. > + * @free_pages: Tracks the number of free pages of granularity > + * PAGE_HINTING_MIN_ORDER. > + * @nbits: Indicates the total size of the bitmap in bits allocated > + * at the time of initialization. > + */ > +struct zone_free_area { > + unsigned long *bitmap; > + unsigned long base_pfn; > + unsigned long end_pfn; > + atomic_t free_pages; > + unsigned long nbits; > +} free_area[MAX_NR_ZONES]; > + You still haven't addressed the NUMA issue I pointed out with v10. You are only able to address the first set of zones with this setup. As such you can end up missing large sections of memory if it is split over multiple nodes. > +static void init_hinting_wq(struct work_struct *work); > +static DEFINE_MUTEX(page_hinting_init); > +const struct page_hinting_config *page_hitning_conf; > +struct work_struct hinting_work; > +atomic_t page_hinting_active; > + > +void free_area_cleanup(int nr_zones) > +{ I'm not sure why you are passing nr_zones as an argument here. Won't this always be MAX_NR_ZONES? > + int zone_idx; > + > + for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) { > + bitmap_free(free_area[zone_idx].bitmap); > + free_area[zone_idx].base_pfn = 0; > + free_area[zone_idx].end_pfn = 0; > + free_area[zone_idx].nbits = 0; > + atomic_set(&free_area[zone_idx].free_pages, 0); > + } > +} > + > +int page_hinting_enable(const struct page_hinting_config *conf) > +{ > + unsigned long bitmap_size = 0; > + int zone_idx = 0, ret = -EBUSY; > + struct zone *zone; > + > + mutex_lock(&page_hinting_init); > + if (!page_hitning_conf) { > + for_each_populated_zone(zone) { So for_each_populated_zone will go through all of the NUMA nodes. So if I am not mistaken you will overwrite the free_area values of all the previous nodes with the last node in the system. So if we have a setup that has all the memory in the first node, and none in the second it would effectively disable free page hinting would it not? > + zone_idx = zone_idx(zone); > +#ifdef CONFIG_ZONE_DEVICE > + if (zone_idx == ZONE_DEVICE) > + continue; > +#endif > + spin_lock(&zone->lock); > + if (free_area[zone_idx].base_pfn) { > + free_area[zone_idx].base_pfn = > + min(free_area[zone_idx].base_pfn, > + zone->zone_start_pfn); > + free_area[zone_idx].end_pfn = > + max(free_area[zone_idx].end_pfn, > + zone->zone_start_pfn + > + zone->spanned_pages); > + } else { > + free_area[zone_idx].base_pfn = > + zone->zone_start_pfn; > + free_area[zone_idx].end_pfn = > + zone->zone_start_pfn + > + zone->spanned_pages; > + } > + spin_unlock(&zone->lock); > + } > + > + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { > + unsigned long pages = free_area[zone_idx].end_pfn - > + free_area[zone_idx].base_pfn; > + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; > + if (!bitmap_size) > + continue; > + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, > + GFP_KERNEL); > + if (!free_area[zone_idx].bitmap) { > + free_area_cleanup(zone_idx); > + mutex_unlock(&page_hinting_init); > + return -ENOMEM; > + } > + free_area[zone_idx].nbits = bitmap_size; > + } So this is the bit that still needs to address hotplug right? I would imagine you need to reallocate this if the spanned_pages range changes correct? > + page_hitning_conf = conf; > + INIT_WORK(&hinting_work, init_hinting_wq); > + ret = 0; > + } > + mutex_unlock(&page_hinting_init); > + return ret; > +} > +EXPORT_SYMBOL_GPL(page_hinting_enable); > + > +void page_hinting_disable(void) > +{ > + cancel_work_sync(&hinting_work); > + page_hitning_conf = NULL; > + free_area_cleanup(MAX_NR_ZONES); > +} > +EXPORT_SYMBOL_GPL(page_hinting_disable); > + > +static unsigned long pfn_to_bit(struct page *page, int zone_idx) > +{ > + unsigned long bitnr; > + > + bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn) > + >> PAGE_HINTING_MIN_ORDER; > + return bitnr; > +} > + > +static void release_buddy_pages(struct list_head *pages) > +{ > + int mt = 0, zone_idx, order; > + struct page *page, *next; > + unsigned long bitnr; > + struct zone *zone; > + > + list_for_each_entry_safe(page, next, pages, lru) { > + zone_idx = page_zonenum(page); > + zone = page_zone(page); > + bitnr = pfn_to_bit(page, zone_idx); > + spin_lock(&zone->lock); > + list_del(&page->lru); > + order = page_private(page); > + set_page_private(page, 0); > + mt = get_pageblock_migratetype(page); > + __free_one_page(page, page_to_pfn(page), zone, > + order, mt, false); > + spin_unlock(&zone->lock); > + } > +} > + > +static void bm_set_pfn(struct page *page) > +{ > + struct zone *zone = page_zone(page); > + int zone_idx = page_zonenum(page); > + unsigned long bitnr = 0; > + > + lockdep_assert_held(&zone->lock); > + bitnr = pfn_to_bit(page, zone_idx); > + /* > + * TODO: fix possible underflows. > + */ > + if (free_area[zone_idx].bitmap && > + bitnr < free_area[zone_idx].nbits && > + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) > + atomic_inc(&free_area[zone_idx].free_pages); > +} > + > +static void scan_zone_free_area(int zone_idx, int free_pages) > +{ > + int ret = 0, order, isolated_cnt = 0; > + unsigned long set_bit, start = 0; > + LIST_HEAD(isolated_pages); > + struct page *page; > + struct zone *zone; > + > + for (;;) { > + ret = 0; > + set_bit = find_next_bit(free_area[zone_idx].bitmap, > + free_area[zone_idx].nbits, start); > + if (set_bit >= free_area[zone_idx].nbits) > + break; > + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + > + free_area[zone_idx].base_pfn); > + if (!page) > + continue; > + zone = page_zone(page); > + spin_lock(&zone->lock); > + > + if (PageBuddy(page) && page_private(page) >= > + PAGE_HINTING_MIN_ORDER) { > + order = page_private(page); > + ret = __isolate_free_page(page, order); > + } > + clear_bit(set_bit, free_area[zone_idx].bitmap); > + atomic_dec(&free_area[zone_idx].free_pages); > + spin_unlock(&zone->lock); > + if (ret) { > + /* > + * restoring page order to use it while releasing > + * the pages back to the buddy. > + */ > + set_page_private(page, order); > + list_add_tail(&page->lru, &isolated_pages); > + isolated_cnt++; > + if (isolated_cnt == page_hitning_conf->max_pages) { > + page_hitning_conf->hint_pages(&isolated_pages); > + release_buddy_pages(&isolated_pages); > + isolated_cnt = 0; > + } > + } > + start = set_bit + 1; > + } > + if (isolated_cnt) { > + page_hitning_conf->hint_pages(&isolated_pages); > + release_buddy_pages(&isolated_pages); > + } > +} > + I really worry that this loop is going to become more expensive as the size of memory increases. For example if we hint on just 16 pages we would have to walk something like 4K bits, 512 longs, if a system had 64G of memory. Have you considered testing with a larger memory footprint to see if it has an impact on performance? > +static void init_hinting_wq(struct work_struct *work) > +{ > + int zone_idx, free_pages; > + > + atomic_set(&page_hinting_active, 1); > + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { > + free_pages = atomic_read(&free_area[zone_idx].free_pages); > + if (free_pages >= page_hitning_conf->max_pages) > + scan_zone_free_area(zone_idx, free_pages); > + } > + atomic_set(&page_hinting_active, 0); > +} > + > +void page_hinting_enqueue(struct page *page, int order) > +{ > + int zone_idx; > + > + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) > + return; I would think it is going to be expensive to be jumping into this function for every freed page. You should probably have an inline taking care of the order check before you even get here since it would be faster that way. > + > + bm_set_pfn(page); > + if (atomic_read(&page_hinting_active)) > + return; So I would think this piece is racy. Specifically if you set a PFN that is somewhere below the PFN you are currently processing in your scan it is going to remain unset until you have another page freed after the scan is completed. I would worry you can end up with a batch free of memory resulting in a group of pages sitting at the start of your bitmap unhinted. In my patches I resolved this by looping through all of the zones, however your approach is missing the necessary pieces to make that safe as you could end up in a soft lockup with the scanning thread spinning on a noisy system. > + zone_idx = zone_idx(page_zone(page)); > + if (atomic_read(&free_area[zone_idx].free_pages) >= > + page_hitning_conf->max_pages) { > + int cpu = smp_processor_id(); > + > + queue_work_on(cpu, system_wq, &hinting_work); > + } > +}
On 7/10/19 4:45 PM, Dave Hansen wrote: > On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >> +struct zone_free_area { >> + unsigned long *bitmap; >> + unsigned long base_pfn; >> + unsigned long end_pfn; >> + atomic_t free_pages; >> + unsigned long nbits; >> +} free_area[MAX_NR_ZONES]; > Why do we need an extra data structure. What's wrong with putting > per-zone data in ... 'struct zone'? The cover letter claims that it > doesn't touch core-mm infrastructure, but if it depends on mechanisms > like this, I think that's a very bad thing. > > To be honest, I'm not sure this series is worth reviewing at this point. > It's horribly lightly commented and full of kernel antipatterns lik > > void func() > { > if () { > ... indent entire logic > ... of function > } > } > > It has big "TODO"s. It's virtually comment-free. I'm shocked it's at > the 11th version and still looking like this. One of the reasons for being on v11 was that the entire design has changed a few times. But that's no excuse, I understand what you are saying and I will work on it and improve this. > >> + >> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >> + unsigned long pages = free_area[zone_idx].end_pfn - >> + free_area[zone_idx].base_pfn; >> + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; >> + if (!bitmap_size) >> + continue; >> + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, >> + GFP_KERNEL); > This doesn't support sparse zones. We can have zones with massive > spanned page sizes, but very few present pages. On those zones, this > will exhaust memory for no good reason. Thanks, I will look into this. > > Comparing this to Alex's patch set, it's of much lower quality and at a > much earlier stage of development. The two sets are not really even > comparable right now. This certainly doesn't sell me on (or even really > enumerate the deltas in) this approach vs. Alex's. >
On 7/10/19 4:45 PM, Dave Hansen wrote: > On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >> +struct zone_free_area { >> + unsigned long *bitmap; >> + unsigned long base_pfn; >> + unsigned long end_pfn; >> + atomic_t free_pages; >> + unsigned long nbits; >> +} free_area[MAX_NR_ZONES]; > Why do we need an extra data structure. What's wrong with putting > per-zone data in ... 'struct zone'? Will it be acceptable to add fields in struct zone, when they will only be used by page hinting? > The cover letter claims that it > doesn't touch core-mm infrastructure, but if it depends on mechanisms > like this, I think that's a very bad thing. > > To be honest, I'm not sure this series is worth reviewing at this point. > It's horribly lightly commented and full of kernel antipatterns lik > > void func() > { > if () { > ... indent entire logic > ... of function > } > } I usually run checkpatch to detect such indentation issues. For the patches, I shared it didn't show me any issues. > > It has big "TODO"s. It's virtually comment-free. I'm shocked it's at > the 11th version and still looking like this. > >> + >> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >> + unsigned long pages = free_area[zone_idx].end_pfn - >> + free_area[zone_idx].base_pfn; >> + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; >> + if (!bitmap_size) >> + continue; >> + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, >> + GFP_KERNEL); > This doesn't support sparse zones. We can have zones with massive > spanned page sizes, but very few present pages. On those zones, this > will exhaust memory for no good reason. > > Comparing this to Alex's patch set, it's of much lower quality and at a > much earlier stage of development. The two sets are not really even > comparable right now. This certainly doesn't sell me on (or even really > enumerate the deltas in) this approach vs. Alex's. >
On 7/11/19 11:25 AM, Nitesh Narayan Lal wrote: > On 7/10/19 4:45 PM, Dave Hansen wrote: >> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >>> +struct zone_free_area { >>> + unsigned long *bitmap; >>> + unsigned long base_pfn; >>> + unsigned long end_pfn; >>> + atomic_t free_pages; >>> + unsigned long nbits; >>> +} free_area[MAX_NR_ZONES]; >> Why do we need an extra data structure. What's wrong with putting >> per-zone data in ... 'struct zone'? > Will it be acceptable to add fields in struct zone, when they will only > be used by page hinting? >> The cover letter claims that it >> doesn't touch core-mm infrastructure, but if it depends on mechanisms >> like this, I think that's a very bad thing. >> >> To be honest, I'm not sure this series is worth reviewing at this point. >> It's horribly lightly commented and full of kernel antipatterns lik >> >> void func() >> { >> if () { >> ... indent entire logic >> ... of function >> } >> } > I usually run checkpatch to detect such indentation issues. For the > patches, I shared it didn't show me any issues. My bad I think I jumped here, I saw what you are referring to here. I will fix these kind of things. >> It has big "TODO"s. It's virtually comment-free. I'm shocked it's at >> the 11th version and still looking like this. >> >>> + >>> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >>> + unsigned long pages = free_area[zone_idx].end_pfn - >>> + free_area[zone_idx].base_pfn; >>> + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; >>> + if (!bitmap_size) >>> + continue; >>> + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, >>> + GFP_KERNEL); >> This doesn't support sparse zones. We can have zones with massive >> spanned page sizes, but very few present pages. On those zones, this >> will exhaust memory for no good reason. >> >> Comparing this to Alex's patch set, it's of much lower quality and at a >> much earlier stage of development. The two sets are not really even >> comparable right now. This certainly doesn't sell me on (or even really >> enumerate the deltas in) this approach vs. Alex's. >>
On 7/11/19 8:25 AM, Nitesh Narayan Lal wrote: > On 7/10/19 4:45 PM, Dave Hansen wrote: >> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >>> +struct zone_free_area { >>> + unsigned long *bitmap; >>> + unsigned long base_pfn; >>> + unsigned long end_pfn; >>> + atomic_t free_pages; >>> + unsigned long nbits; >>> +} free_area[MAX_NR_ZONES]; >> Why do we need an extra data structure. What's wrong with putting >> per-zone data in ... 'struct zone'? > Will it be acceptable to add fields in struct zone, when they will only > be used by page hinting? Wait a sec... MAX_NR_ZONES the number of zone types not the maximum number of *zones* in the system. Did you test this on a NUMA system? In any case, yes, you can put these in 'struct zone'. It will waste less space that way, on average, than what you have here (one you scale it to MAX_NR_ZONE*MAX_NUM_NODES. >> The cover letter claims that it >> doesn't touch core-mm infrastructure, but if it depends on mechanisms >> like this, I think that's a very bad thing. >> >> To be honest, I'm not sure this series is worth reviewing at this point. >> It's horribly lightly commented and full of kernel antipatterns lik >> >> void func() >> { >> if () { >> ... indent entire logic >> ... of function >> } >> } > I usually run checkpatch to detect such indentation issues. For the > patches, I shared it didn't show me any issues. Just because checkpatch doesn't complain does not mean it is good form. We write the above as: void func() { if (!something) goto out; ... logic of function here out: // cleanup }
On 7/11/19 12:22 PM, Dave Hansen wrote: > On 7/11/19 8:25 AM, Nitesh Narayan Lal wrote: >> On 7/10/19 4:45 PM, Dave Hansen wrote: >>> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >>>> +struct zone_free_area { >>>> + unsigned long *bitmap; >>>> + unsigned long base_pfn; >>>> + unsigned long end_pfn; >>>> + atomic_t free_pages; >>>> + unsigned long nbits; >>>> +} free_area[MAX_NR_ZONES]; >>> Why do we need an extra data structure. What's wrong with putting >>> per-zone data in ... 'struct zone'? >> Will it be acceptable to add fields in struct zone, when they will only >> be used by page hinting? > Wait a sec... MAX_NR_ZONES the number of zone types not the maximum > number of *zones* in the system. > > Did you test this on a NUMA system? Yes, I tested it with a guest having 2 and 3 NUMA nodes. > In any case, yes, you can put these in 'struct zone'. It will waste > less space that way, on average, than what you have here (one you scale > it to MAX_NR_ZONE*MAX_NUM_NODES. >>> The cover letter claims that it >>> doesn't touch core-mm infrastructure, but if it depends on mechanisms >>> like this, I think that's a very bad thing. >>> >>> To be honest, I'm not sure this series is worth reviewing at this point. >>> It's horribly lightly commented and full of kernel antipatterns lik >>> >>> void func() >>> { >>> if () { >>> ... indent entire logic >>> ... of function >>> } >>> } >> I usually run checkpatch to detect such indentation issues. For the >> patches, I shared it didn't show me any issues. > Just because checkpatch doesn't complain does not mean it is good form. > We write the above as: > > void func() > { > if (!something) > goto out; > > ... logic of function here > out: > // cleanup > } Yeap, I got it. I will correct this.
On 7/11/19 9:36 AM, Nitesh Narayan Lal wrote: >>>>> +struct zone_free_area { >>>>> + unsigned long *bitmap; >>>>> + unsigned long base_pfn; >>>>> + unsigned long end_pfn; >>>>> + atomic_t free_pages; >>>>> + unsigned long nbits; >>>>> +} free_area[MAX_NR_ZONES]; >>>> Why do we need an extra data structure. What's wrong with putting >>>> per-zone data in ... 'struct zone'? >>> Will it be acceptable to add fields in struct zone, when they will only >>> be used by page hinting? >> Wait a sec... MAX_NR_ZONES the number of zone types not the maximum >> number of *zones* in the system. >> >> Did you test this on a NUMA system? > Yes, I tested it with a guest having 2 and 3 NUMA nodes. How can this *possibly* have worked? Won't each same-typed zone just use the same free_area[] entry since zone_idx(zone1)==zone_idx(zone2) if zone1 and zone2 are (for example) both ZONE_NORMAL?
On 7/11/19 12:45 PM, Dave Hansen wrote: > On 7/11/19 9:36 AM, Nitesh Narayan Lal wrote: >>>>>> +struct zone_free_area { >>>>>> + unsigned long *bitmap; >>>>>> + unsigned long base_pfn; >>>>>> + unsigned long end_pfn; >>>>>> + atomic_t free_pages; >>>>>> + unsigned long nbits; >>>>>> +} free_area[MAX_NR_ZONES]; >>>>> Why do we need an extra data structure. What's wrong with putting >>>>> per-zone data in ... 'struct zone'? >>>> Will it be acceptable to add fields in struct zone, when they will only >>>> be used by page hinting? >>> Wait a sec... MAX_NR_ZONES the number of zone types not the maximum >>> number of *zones* in the system. >>> >>> Did you test this on a NUMA system? >> Yes, I tested it with a guest having 2 and 3 NUMA nodes. > How can this *possibly* have worked? > > Won't each same-typed zone just use the same free_area[] entry since > zone_idx(zone1)==zone_idx(zone2) if zone1 and zone2 are (for example) > both ZONE_NORMAL? Yes. However, the base_pfn and end_pfn will be updated with the zone1's base and zone2s end_pfn value from page_hinting_enable(). Isn't?
On 7/10/19 5:56 PM, Alexander Duyck wrote: > On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >> This patch introduces the core infrastructure for free page hinting in >> virtual environments. It enables the kernel to track the free pages which >> can be reported to its hypervisor so that the hypervisor could >> free and reuse that memory as per its requirement. >> >> While the pages are getting processed in the hypervisor (e.g., >> via MADV_FREE), the guest must not use them, otherwise, data loss >> would be possible. To avoid such a situation, these pages are >> temporarily removed from the buddy. The amount of pages removed >> temporarily from the buddy is governed by the backend(virtio-balloon >> in our case). >> >> To efficiently identify free pages that can to be hinted to the >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >> chunks are reported to the hypervisor - especially, to not break up THP >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >> in the bitmap are an indication whether a page *might* be free, not a >> guarantee. A new hook after buddy merging sets the bits. >> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue >> asynchronously processes the bitmaps, trying to isolate and report pages >> that are still free. The backend (virtio-balloon) is responsible for >> reporting these batched pages to the host synchronously. Once reporting/ >> freeing is complete, isolated pages are returned back to the buddy. >> >> There are still various things to look into (e.g., memory hotplug, more >> efficient locking, possible races when disabling). >> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> >> --- >> include/linux/page_hinting.h | 45 +++++++ >> mm/Kconfig | 6 + >> mm/Makefile | 1 + >> mm/page_alloc.c | 18 +-- >> mm/page_hinting.c | 250 +++++++++++++++++++++++++++++++++++ >> 5 files changed, 312 insertions(+), 8 deletions(-) >> create mode 100644 include/linux/page_hinting.h >> create mode 100644 mm/page_hinting.c >> >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h >> new file mode 100644 >> index 000000000000..4900feb796f9 >> --- /dev/null >> +++ b/include/linux/page_hinting.h >> @@ -0,0 +1,45 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_PAGE_HINTING_H >> +#define _LINUX_PAGE_HINTING_H >> + >> +/* >> + * Minimum page order required for a page to be hinted to the host. >> + */ >> +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) >> + > Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed > out earlier for is it due to something else? I'm just wondering if > this will have an impact on architectures outside of x86 as I had > chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I > don't know that the impact of doing that is on other architectures > versus the (MAX_ORDER - 2) approach you took here. If I am not wrong then any order < (MAX_ORDER - 2) will break the THP. That's one reason we decided to stick with this. > >> +/* >> + * struct page_hinting_config: holds the information supplied by the balloon >> + * device to page hinting. >> + * @hint_pages: Callback which reports the isolated pages >> + * synchornously to the host. >> + * @max_pages: Maxmimum pages that are going to be hinted to the host >> + * at a time of granularity >= PAGE_HINTING_MIN_ORDER. >> + */ >> +struct page_hinting_config { >> + void (*hint_pages)(struct list_head *list); >> + int max_pages; >> +}; >> + >> +extern int __isolate_free_page(struct page *page, unsigned int order); >> +extern void __free_one_page(struct page *page, unsigned long pfn, >> + struct zone *zone, unsigned int order, >> + int migratetype, bool hint); >> +#ifdef CONFIG_PAGE_HINTING >> +void page_hinting_enqueue(struct page *page, int order); >> +int page_hinting_enable(const struct page_hinting_config *conf); >> +void page_hinting_disable(void); >> +#else >> +static inline void page_hinting_enqueue(struct page *page, int order) >> +{ >> +} >> + >> +static inline int page_hinting_enable(const struct page_hinting_config *conf) >> +{ >> + return -EOPNOTSUPP; >> +} >> + >> +static inline void page_hinting_disable(void) >> +{ >> +} >> +#endif >> +#endif /* _LINUX_PAGE_HINTING_H */ >> diff --git a/mm/Kconfig b/mm/Kconfig >> index f0c76ba47695..e97fab429d9b 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -765,4 +765,10 @@ config GUP_BENCHMARK >> config ARCH_HAS_PTE_SPECIAL >> bool >> >> +# PAGE_HINTING will allow the guest to report the free pages to the >> +# host in fixed chunks as soon as the threshold is reached. >> +config PAGE_HINTING >> + bool >> + def_bool n >> + depends on X86_64 >> endmenu > If there are no issue with using the term "PAGE_HINTING" I guess I > will update my patch set to use that term instead of aeration. Not sure, at places like virtio_balloon, we may have to think of something else to avoid any confusion. > >> diff --git a/mm/Makefile b/mm/Makefile >> index ac5e5ba78874..73be49177656 100644 >> --- a/mm/Makefile >> +++ b/mm/Makefile >> @@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o >> obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o >> obj-$(CONFIG_CMA) += cma.o >> obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o >> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o >> obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o >> obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o >> obj-$(CONFIG_USERFAULTFD) += userfaultfd.o >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index d66bc8abe0af..8a44338bd04e 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -69,6 +69,7 @@ >> #include <linux/lockdep.h> >> #include <linux/nmi.h> >> #include <linux/psi.h> >> +#include <linux/page_hinting.h> >> >> #include <asm/sections.h> >> #include <asm/tlbflush.h> >> @@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page, >> * -- nyc >> */ >> >> -static inline void __free_one_page(struct page *page, >> +inline void __free_one_page(struct page *page, >> unsigned long pfn, >> struct zone *zone, unsigned int order, >> - int migratetype) >> + int migratetype, bool hint) >> { >> unsigned long combined_pfn; >> unsigned long uninitialized_var(buddy_pfn); >> @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page, >> migratetype); >> else >> add_to_free_area(page, &zone->free_area[order], migratetype); >> - >> + if (hint) >> + page_hinting_enqueue(page, order); >> } > I'm not sure I am a fan of the way the word "hint" is used here. At > first I thought this was supposed to be !hint since I thought hint > meant that it was a hinted page, not that we need to record that this > page has been freed. Maybe "record" or "report" might be a better word > to use here. "hint" basically means that the page is supposed to be hinted. >> /* >> @@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> if (unlikely(isolated_pageblocks)) >> mt = get_pageblock_migratetype(page); >> >> - __free_one_page(page, page_to_pfn(page), zone, 0, mt); >> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); >> trace_mm_page_pcpu_drain(page, 0, mt); >> } >> spin_unlock(&zone->lock); >> @@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count, >> static void free_one_page(struct zone *zone, >> struct page *page, unsigned long pfn, >> unsigned int order, >> - int migratetype) >> + int migratetype, bool hint) >> { >> spin_lock(&zone->lock); >> if (unlikely(has_isolate_pageblock(zone) || >> is_migrate_isolate(migratetype))) { >> migratetype = get_pfnblock_migratetype(page, pfn); >> } >> - __free_one_page(page, pfn, zone, order, migratetype); >> + __free_one_page(page, pfn, zone, order, migratetype, hint); >> spin_unlock(&zone->lock); >> } >> >> @@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) >> migratetype = get_pfnblock_migratetype(page, pfn); >> local_irq_save(flags); >> __count_vm_events(PGFREE, 1 << order); >> - free_one_page(page_zone(page), page, pfn, order, migratetype); >> + free_one_page(page_zone(page), page, pfn, order, migratetype, true); >> local_irq_restore(flags); >> } >> >> @@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) >> */ >> if (migratetype >= MIGRATE_PCPTYPES) { >> if (unlikely(is_migrate_isolate(migratetype))) { >> - free_one_page(zone, page, pfn, 0, migratetype); >> + free_one_page(zone, page, pfn, 0, migratetype, true); >> return; >> } >> migratetype = MIGRATE_MOVABLE; >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c >> new file mode 100644 >> index 000000000000..0bfa09f8c3ed >> --- /dev/null >> +++ b/mm/page_hinting.c >> @@ -0,0 +1,250 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Page hinting core infrastructure to enable a VM to report free pages to its >> + * hypervisor. >> + * >> + * Copyright Red Hat, Inc. 2019 >> + * >> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com> >> + */ >> + >> +#include <linux/mm.h> >> +#include <linux/slab.h> >> +#include <linux/page_hinting.h> >> +#include <linux/kvm_host.h> >> + >> +/* >> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the >> + * bitmap pointer to track the free pages and other required parameters >> + * used to recover these pages by scanning the bitmap. >> + * @bitmap: Pointer to the bitmap in PAGE_HINTING_MIN_ORDER >> + * granularity. >> + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. >> + * @end_pfn: Indicates the last PFN value for the zone. >> + * @free_pages: Tracks the number of free pages of granularity >> + * PAGE_HINTING_MIN_ORDER. >> + * @nbits: Indicates the total size of the bitmap in bits allocated >> + * at the time of initialization. >> + */ >> +struct zone_free_area { >> + unsigned long *bitmap; >> + unsigned long base_pfn; >> + unsigned long end_pfn; >> + atomic_t free_pages; >> + unsigned long nbits; >> +} free_area[MAX_NR_ZONES]; >> + > You still haven't addressed the NUMA issue I pointed out with v10. You > are only able to address the first set of zones with this setup. As > such you can end up missing large sections of memory if it is split > over multiple nodes. I think I did. > >> +static void init_hinting_wq(struct work_struct *work); >> +static DEFINE_MUTEX(page_hinting_init); >> +const struct page_hinting_config *page_hitning_conf; >> +struct work_struct hinting_work; >> +atomic_t page_hinting_active; >> + >> +void free_area_cleanup(int nr_zones) >> +{ > I'm not sure why you are passing nr_zones as an argument here. Won't > this always be MAX_NR_ZONES? free_area_cleanup() gets called from page_hinting_disable() and page_hinting_enable(). In page_hinting_enable() when the allocation fails we may not have to perform cleanup for all the zones everytime. > >> + int zone_idx; >> + >> + for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) { >> + bitmap_free(free_area[zone_idx].bitmap); >> + free_area[zone_idx].base_pfn = 0; >> + free_area[zone_idx].end_pfn = 0; >> + free_area[zone_idx].nbits = 0; >> + atomic_set(&free_area[zone_idx].free_pages, 0); >> + } >> +} >> + >> +int page_hinting_enable(const struct page_hinting_config *conf) >> +{ >> + unsigned long bitmap_size = 0; >> + int zone_idx = 0, ret = -EBUSY; >> + struct zone *zone; >> + >> + mutex_lock(&page_hinting_init); >> + if (!page_hitning_conf) { >> + for_each_populated_zone(zone) { > So for_each_populated_zone will go through all of the NUMA nodes. So > if I am not mistaken you will overwrite the free_area values of all > the previous nodes with the last node in the system. Not sure if I understood. > So if we have a > setup that has all the memory in the first node, and none in the > second it would effectively disable free page hinting would it not? Why will it happen? The base_pfn will still be pointing to the base_pfn of the first node. Isn't? > >> + zone_idx = zone_idx(zone); >> +#ifdef CONFIG_ZONE_DEVICE >> + if (zone_idx == ZONE_DEVICE) >> + continue; >> +#endif >> + spin_lock(&zone->lock); >> + if (free_area[zone_idx].base_pfn) { >> + free_area[zone_idx].base_pfn = >> + min(free_area[zone_idx].base_pfn, >> + zone->zone_start_pfn); >> + free_area[zone_idx].end_pfn = >> + max(free_area[zone_idx].end_pfn, >> + zone->zone_start_pfn + >> + zone->spanned_pages); >> + } else { >> + free_area[zone_idx].base_pfn = >> + zone->zone_start_pfn; >> + free_area[zone_idx].end_pfn = >> + zone->zone_start_pfn + >> + zone->spanned_pages; >> + } >> + spin_unlock(&zone->lock); >> + } >> + >> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >> + unsigned long pages = free_area[zone_idx].end_pfn - >> + free_area[zone_idx].base_pfn; >> + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; >> + if (!bitmap_size) >> + continue; >> + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, >> + GFP_KERNEL); >> + if (!free_area[zone_idx].bitmap) { >> + free_area_cleanup(zone_idx); >> + mutex_unlock(&page_hinting_init); >> + return -ENOMEM; >> + } >> + free_area[zone_idx].nbits = bitmap_size; >> + } > So this is the bit that still needs to address hotplug right? Yes, hotplug still needs to be addressed. > I would > imagine you need to reallocate this if the spanned_pages range changes > correct? > >> + page_hitning_conf = conf; >> + INIT_WORK(&hinting_work, init_hinting_wq); >> + ret = 0; >> + } >> + mutex_unlock(&page_hinting_init); >> + return ret; >> +} >> +EXPORT_SYMBOL_GPL(page_hinting_enable); >> + >> +void page_hinting_disable(void) >> +{ >> + cancel_work_sync(&hinting_work); >> + page_hitning_conf = NULL; >> + free_area_cleanup(MAX_NR_ZONES); >> +} >> +EXPORT_SYMBOL_GPL(page_hinting_disable); >> + >> +static unsigned long pfn_to_bit(struct page *page, int zone_idx) >> +{ >> + unsigned long bitnr; >> + >> + bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn) >> + >> PAGE_HINTING_MIN_ORDER; >> + return bitnr; >> +} >> + >> +static void release_buddy_pages(struct list_head *pages) >> +{ >> + int mt = 0, zone_idx, order; >> + struct page *page, *next; >> + unsigned long bitnr; >> + struct zone *zone; >> + >> + list_for_each_entry_safe(page, next, pages, lru) { >> + zone_idx = page_zonenum(page); >> + zone = page_zone(page); >> + bitnr = pfn_to_bit(page, zone_idx); >> + spin_lock(&zone->lock); >> + list_del(&page->lru); >> + order = page_private(page); >> + set_page_private(page, 0); >> + mt = get_pageblock_migratetype(page); >> + __free_one_page(page, page_to_pfn(page), zone, >> + order, mt, false); >> + spin_unlock(&zone->lock); >> + } >> +} >> + >> +static void bm_set_pfn(struct page *page) >> +{ >> + struct zone *zone = page_zone(page); >> + int zone_idx = page_zonenum(page); >> + unsigned long bitnr = 0; >> + >> + lockdep_assert_held(&zone->lock); >> + bitnr = pfn_to_bit(page, zone_idx); >> + /* >> + * TODO: fix possible underflows. >> + */ >> + if (free_area[zone_idx].bitmap && >> + bitnr < free_area[zone_idx].nbits && >> + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) >> + atomic_inc(&free_area[zone_idx].free_pages); >> +} >> + >> +static void scan_zone_free_area(int zone_idx, int free_pages) >> +{ >> + int ret = 0, order, isolated_cnt = 0; >> + unsigned long set_bit, start = 0; >> + LIST_HEAD(isolated_pages); >> + struct page *page; >> + struct zone *zone; >> + >> + for (;;) { >> + ret = 0; >> + set_bit = find_next_bit(free_area[zone_idx].bitmap, >> + free_area[zone_idx].nbits, start); >> + if (set_bit >= free_area[zone_idx].nbits) >> + break; >> + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + >> + free_area[zone_idx].base_pfn); >> + if (!page) >> + continue; >> + zone = page_zone(page); >> + spin_lock(&zone->lock); >> + >> + if (PageBuddy(page) && page_private(page) >= >> + PAGE_HINTING_MIN_ORDER) { >> + order = page_private(page); >> + ret = __isolate_free_page(page, order); >> + } >> + clear_bit(set_bit, free_area[zone_idx].bitmap); >> + atomic_dec(&free_area[zone_idx].free_pages); >> + spin_unlock(&zone->lock); >> + if (ret) { >> + /* >> + * restoring page order to use it while releasing >> + * the pages back to the buddy. >> + */ >> + set_page_private(page, order); >> + list_add_tail(&page->lru, &isolated_pages); >> + isolated_cnt++; >> + if (isolated_cnt == page_hitning_conf->max_pages) { >> + page_hitning_conf->hint_pages(&isolated_pages); >> + release_buddy_pages(&isolated_pages); >> + isolated_cnt = 0; >> + } >> + } >> + start = set_bit + 1; >> + } >> + if (isolated_cnt) { >> + page_hitning_conf->hint_pages(&isolated_pages); >> + release_buddy_pages(&isolated_pages); >> + } >> +} >> + > I really worry that this loop is going to become more expensive as the > size of memory increases. For example if we hint on just 16 pages we > would have to walk something like 4K bits, 512 longs, if a system had > 64G of memory. Have you considered testing with a larger memory > footprint to see if it has an impact on performance? I am hoping this will be noticeable in will-it-scale's page_fault1, if I run it on a larger system? > >> +static void init_hinting_wq(struct work_struct *work) >> +{ >> + int zone_idx, free_pages; >> + >> + atomic_set(&page_hinting_active, 1); >> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >> + free_pages = atomic_read(&free_area[zone_idx].free_pages); >> + if (free_pages >= page_hitning_conf->max_pages) >> + scan_zone_free_area(zone_idx, free_pages); >> + } >> + atomic_set(&page_hinting_active, 0); >> +} >> + >> +void page_hinting_enqueue(struct page *page, int order) >> +{ >> + int zone_idx; >> + >> + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) >> + return; > I would think it is going to be expensive to be jumping into this > function for every freed page. You should probably have an inline > taking care of the order check before you even get here since it would > be faster that way. I see, I can take a look. Thanks. > >> + >> + bm_set_pfn(page); >> + if (atomic_read(&page_hinting_active)) >> + return; > So I would think this piece is racy. Specifically if you set a PFN > that is somewhere below the PFN you are currently processing in your > scan it is going to remain unset until you have another page freed > after the scan is completed. I would worry you can end up with a batch > free of memory resulting in a group of pages sitting at the start of > your bitmap unhinted. True, but that will be hinted next time threshold is met. > > In my patches I resolved this by looping through all of the zones, > however your approach is missing the necessary pieces to make that > safe as you could end up in a soft lockup with the scanning thread > spinning on a noisy system. > >> + zone_idx = zone_idx(page_zone(page)); >> + if (atomic_read(&free_area[zone_idx].free_pages) >= >> + page_hitning_conf->max_pages) { >> + int cpu = smp_processor_id(); >> + >> + queue_work_on(cpu, system_wq, &hinting_work); >> + } >> +}
On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: > +static void bm_set_pfn(struct page *page) > +{ > + struct zone *zone = page_zone(page); > + int zone_idx = page_zonenum(page); > + unsigned long bitnr = 0; > + > + lockdep_assert_held(&zone->lock); > + bitnr = pfn_to_bit(page, zone_idx); > + /* > + * TODO: fix possible underflows. > + */ > + if (free_area[zone_idx].bitmap && > + bitnr < free_area[zone_idx].nbits && > + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) > + atomic_inc(&free_area[zone_idx].free_pages); > +} Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE and each zone with 1GB of memory: Node: 0 1 NORMAL 0->1GB 2->3GB MOVABLE 1->2GB 3->4GB This code will allocate two bitmaps. The ZONE_NORMAL bitmap will represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent data from 1->4GB. That's the result of this code: > + if (free_area[zone_idx].base_pfn) { > + free_area[zone_idx].base_pfn = > + min(free_area[zone_idx].base_pfn, > + zone->zone_start_pfn); > + free_area[zone_idx].end_pfn = > + max(free_area[zone_idx].end_pfn, > + zone->zone_start_pfn + > + zone->spanned_pages); But that means that both bitmaps will have space for PFNs in the other zone type, which is completely bogus. This is fundamental because the data structures are incorrectly built per zone *type* instead of per zone.
On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > > > On 7/10/19 5:56 PM, Alexander Duyck wrote: > > On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > >> This patch introduces the core infrastructure for free page hinting in > >> virtual environments. It enables the kernel to track the free pages which > >> can be reported to its hypervisor so that the hypervisor could > >> free and reuse that memory as per its requirement. > >> > >> While the pages are getting processed in the hypervisor (e.g., > >> via MADV_FREE), the guest must not use them, otherwise, data loss > >> would be possible. To avoid such a situation, these pages are > >> temporarily removed from the buddy. The amount of pages removed > >> temporarily from the buddy is governed by the backend(virtio-balloon > >> in our case). > >> > >> To efficiently identify free pages that can to be hinted to the > >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big > >> chunks are reported to the hypervisor - especially, to not break up THP > >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits > >> in the bitmap are an indication whether a page *might* be free, not a > >> guarantee. A new hook after buddy merging sets the bits. > >> > >> Bitmaps are stored per zone, protected by the zone lock. A workqueue > >> asynchronously processes the bitmaps, trying to isolate and report pages > >> that are still free. The backend (virtio-balloon) is responsible for > >> reporting these batched pages to the host synchronously. Once reporting/ > >> freeing is complete, isolated pages are returned back to the buddy. > >> > >> There are still various things to look into (e.g., memory hotplug, more > >> efficient locking, possible races when disabling). > >> > >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> So just FYI, I thought I would try the patches. It looks like there might be a bug somewhere that is causing it to free memory it shouldn't be. After about 10 minutes my VM crashed with a system log full of various NULL pointer dereferences. The only change I had made is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers didn't have MADV_FREE on the host. It occurs to me one advantage of MADV_DONTNEED over MADV_FREE is that you are more likely to catch these sort of errors since it zeros the pages instead of leaving them intact. > >> --- > >> include/linux/page_hinting.h | 45 +++++++ > >> mm/Kconfig | 6 + > >> mm/Makefile | 1 + > >> mm/page_alloc.c | 18 +-- > >> mm/page_hinting.c | 250 +++++++++++++++++++++++++++++++++++ > >> 5 files changed, 312 insertions(+), 8 deletions(-) > >> create mode 100644 include/linux/page_hinting.h > >> create mode 100644 mm/page_hinting.c > >> > >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h > >> new file mode 100644 > >> index 000000000000..4900feb796f9 > >> --- /dev/null > >> +++ b/include/linux/page_hinting.h > >> @@ -0,0 +1,45 @@ > >> +/* SPDX-License-Identifier: GPL-2.0 */ > >> +#ifndef _LINUX_PAGE_HINTING_H > >> +#define _LINUX_PAGE_HINTING_H > >> + > >> +/* > >> + * Minimum page order required for a page to be hinted to the host. > >> + */ > >> +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) > >> + > > Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed > > out earlier for is it due to something else? I'm just wondering if > > this will have an impact on architectures outside of x86 as I had > > chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I > > don't know that the impact of doing that is on other architectures > > versus the (MAX_ORDER - 2) approach you took here. > If I am not wrong then any order < (MAX_ORDER - 2) will break the THP. > That's one reason we decided to stick with this. That is true for x86, but I don't think that is true for other architectures. That is why I went with pageblock_order instead of just using a fixed value such as MAX_ORDER - 2. <snip> > >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c > >> new file mode 100644 > >> index 000000000000..0bfa09f8c3ed > >> --- /dev/null > >> +++ b/mm/page_hinting.c > >> @@ -0,0 +1,250 @@ > >> +// SPDX-License-Identifier: GPL-2.0 > >> +/* > >> + * Page hinting core infrastructure to enable a VM to report free pages to its > >> + * hypervisor. > >> + * > >> + * Copyright Red Hat, Inc. 2019 > >> + * > >> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com> > >> + */ > >> + > >> +#include <linux/mm.h> > >> +#include <linux/slab.h> > >> +#include <linux/page_hinting.h> > >> +#include <linux/kvm_host.h> > >> + > >> +/* > >> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the > >> + * bitmap pointer to track the free pages and other required parameters > >> + * used to recover these pages by scanning the bitmap. > >> + * @bitmap: Pointer to the bitmap in PAGE_HINTING_MIN_ORDER > >> + * granularity. > >> + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. > >> + * @end_pfn: Indicates the last PFN value for the zone. > >> + * @free_pages: Tracks the number of free pages of granularity > >> + * PAGE_HINTING_MIN_ORDER. > >> + * @nbits: Indicates the total size of the bitmap in bits allocated > >> + * at the time of initialization. > >> + */ > >> +struct zone_free_area { > >> + unsigned long *bitmap; > >> + unsigned long base_pfn; > >> + unsigned long end_pfn; > >> + atomic_t free_pages; > >> + unsigned long nbits; > >> +} free_area[MAX_NR_ZONES]; > >> + > > You still haven't addressed the NUMA issue I pointed out with v10. You > > are only able to address the first set of zones with this setup. As > > such you can end up missing large sections of memory if it is split > > over multiple nodes. > I think I did. I just realized what you did. Actually this doesn't really improve things in my opinion. More comments below. > > > >> +static void init_hinting_wq(struct work_struct *work); > >> +static DEFINE_MUTEX(page_hinting_init); > >> +const struct page_hinting_config *page_hitning_conf; > >> +struct work_struct hinting_work; > >> +atomic_t page_hinting_active; > >> + > >> +void free_area_cleanup(int nr_zones) > >> +{ > > I'm not sure why you are passing nr_zones as an argument here. Won't > > this always be MAX_NR_ZONES? > free_area_cleanup() gets called from page_hinting_disable() and > page_hinting_enable(). In page_hinting_enable() when the allocation > fails we may not have to perform cleanup for all the zones everytime. Just adding a NULL pointer check to this loop below would still keep it pretty cheap as the cost for initializing memory to 0 isn't that high, and this is slow path anyway. Either way I guess it works. You might want to reset the bitmap pointer to NULL though after you free it to more easily catch the double free case. > >> + int zone_idx; > >> + > >> + for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) { > >> + bitmap_free(free_area[zone_idx].bitmap); > >> + free_area[zone_idx].base_pfn = 0; > >> + free_area[zone_idx].end_pfn = 0; > >> + free_area[zone_idx].nbits = 0; > >> + atomic_set(&free_area[zone_idx].free_pages, 0); > >> + } > >> +} > >> + > >> +int page_hinting_enable(const struct page_hinting_config *conf) > >> +{ > >> + unsigned long bitmap_size = 0; > >> + int zone_idx = 0, ret = -EBUSY; > >> + struct zone *zone; > >> + > >> + mutex_lock(&page_hinting_init); > >> + if (!page_hitning_conf) { > >> + for_each_populated_zone(zone) { > > So for_each_populated_zone will go through all of the NUMA nodes. So > > if I am not mistaken you will overwrite the free_area values of all > > the previous nodes with the last node in the system. > Not sure if I understood. I misread the code. More comments below. > > So if we have a > > setup that has all the memory in the first node, and none in the > > second it would effectively disable free page hinting would it not? > Why will it happen? The base_pfn will still be pointing to the base_pfn > of the first node. Isn't? So this does address my concern however, it introduces a new issue. Specifically you could end up introducing a gap of unused bits if the memory from one zone is not immediately adjacent to another. This gets back to the SPARSEMEM issue that I think Dave pointed out. <snip> > >> +static void scan_zone_free_area(int zone_idx, int free_pages) > >> +{ > >> + int ret = 0, order, isolated_cnt = 0; > >> + unsigned long set_bit, start = 0; > >> + LIST_HEAD(isolated_pages); > >> + struct page *page; > >> + struct zone *zone; > >> + > >> + for (;;) { > >> + ret = 0; > >> + set_bit = find_next_bit(free_area[zone_idx].bitmap, > >> + free_area[zone_idx].nbits, start); > >> + if (set_bit >= free_area[zone_idx].nbits) > >> + break; > >> + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + > >> + free_area[zone_idx].base_pfn); > >> + if (!page) > >> + continue; > >> + zone = page_zone(page); > >> + spin_lock(&zone->lock); > >> + > >> + if (PageBuddy(page) && page_private(page) >= > >> + PAGE_HINTING_MIN_ORDER) { > >> + order = page_private(page); > >> + ret = __isolate_free_page(page, order); > >> + } > >> + clear_bit(set_bit, free_area[zone_idx].bitmap); > >> + atomic_dec(&free_area[zone_idx].free_pages); > >> + spin_unlock(&zone->lock); > >> + if (ret) { > >> + /* > >> + * restoring page order to use it while releasing > >> + * the pages back to the buddy. > >> + */ > >> + set_page_private(page, order); > >> + list_add_tail(&page->lru, &isolated_pages); > >> + isolated_cnt++; > >> + if (isolated_cnt == page_hitning_conf->max_pages) { > >> + page_hitning_conf->hint_pages(&isolated_pages); > >> + release_buddy_pages(&isolated_pages); > >> + isolated_cnt = 0; > >> + } > >> + } > >> + start = set_bit + 1; > >> + } > >> + if (isolated_cnt) { > >> + page_hitning_conf->hint_pages(&isolated_pages); > >> + release_buddy_pages(&isolated_pages); > >> + } > >> +} > >> + > > I really worry that this loop is going to become more expensive as the > > size of memory increases. For example if we hint on just 16 pages we > > would have to walk something like 4K bits, 512 longs, if a system had > > 64G of memory. Have you considered testing with a larger memory > > footprint to see if it has an impact on performance? > I am hoping this will be noticeable in will-it-scale's page_fault1, if I > run it on a larger system? What you will probably see is that the CPU that is running the scan is going to be sitting at somewhere near 100% because I cannot see how it can hope to stay efficient if it has to check something like 512 64b longs searching for just a handful of idle pages. > > > >> +static void init_hinting_wq(struct work_struct *work) > >> +{ > >> + int zone_idx, free_pages; > >> + > >> + atomic_set(&page_hinting_active, 1); > >> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { > >> + free_pages = atomic_read(&free_area[zone_idx].free_pages); > >> + if (free_pages >= page_hitning_conf->max_pages) > >> + scan_zone_free_area(zone_idx, free_pages); > >> + } > >> + atomic_set(&page_hinting_active, 0); > >> +} > >> + > >> +void page_hinting_enqueue(struct page *page, int order) > >> +{ > >> + int zone_idx; > >> + > >> + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) > >> + return; > > I would think it is going to be expensive to be jumping into this > > function for every freed page. You should probably have an inline > > taking care of the order check before you even get here since it would > > be faster that way. > I see, I can take a look. Thanks. > > > >> + > >> + bm_set_pfn(page); > >> + if (atomic_read(&page_hinting_active)) > >> + return; > > So I would think this piece is racy. Specifically if you set a PFN > > that is somewhere below the PFN you are currently processing in your > > scan it is going to remain unset until you have another page freed > > after the scan is completed. I would worry you can end up with a batch > > free of memory resulting in a group of pages sitting at the start of > > your bitmap unhinted. > True, but that will be hinted next time threshold is met. Yes, but that assumes that there is another free immediately coming. It is possible that you have a big application run and then immediately shut down and have it free all its memory at once. Worst case scenario would be that it starts by freeing from the end and works toward the start. With that you could theoretically end up with a significant chunk of memory waiting some time for another big free to come along.
On 7/11/19 7:20 PM, Alexander Duyck wrote: > On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >> >> On 7/10/19 5:56 PM, Alexander Duyck wrote: >>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >>>> This patch introduces the core infrastructure for free page hinting in >>>> virtual environments. It enables the kernel to track the free pages which >>>> can be reported to its hypervisor so that the hypervisor could >>>> free and reuse that memory as per its requirement. >>>> >>>> While the pages are getting processed in the hypervisor (e.g., >>>> via MADV_FREE), the guest must not use them, otherwise, data loss >>>> would be possible. To avoid such a situation, these pages are >>>> temporarily removed from the buddy. The amount of pages removed >>>> temporarily from the buddy is governed by the backend(virtio-balloon >>>> in our case). >>>> >>>> To efficiently identify free pages that can to be hinted to the >>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >>>> chunks are reported to the hypervisor - especially, to not break up THP >>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >>>> in the bitmap are an indication whether a page *might* be free, not a >>>> guarantee. A new hook after buddy merging sets the bits. >>>> >>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue >>>> asynchronously processes the bitmaps, trying to isolate and report pages >>>> that are still free. The backend (virtio-balloon) is responsible for >>>> reporting these batched pages to the host synchronously. Once reporting/ >>>> freeing is complete, isolated pages are returned back to the buddy. >>>> >>>> There are still various things to look into (e.g., memory hotplug, more >>>> efficient locking, possible races when disabling). >>>> >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> > So just FYI, I thought I would try the patches. It looks like there > might be a bug somewhere that is causing it to free memory it > shouldn't be. After about 10 minutes my VM crashed with a system log > full of various NULL pointer dereferences. That's interesting, I have tried the patches with MADV_DONTNEED as well. I just retried it but didn't see any crash. May I know what kind of workload you are running? > The only change I had made > is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers > didn't have MADV_FREE on the host. It occurs to me one advantage of > MADV_DONTNEED over MADV_FREE is that you are more likely to catch > these sort of errors since it zeros the pages instead of leaving them > intact. For development purpose maybe. For the final patch-set I think we discussed earlier why we should keep MADV_FREE. > >>>> --- >>>> include/linux/page_hinting.h | 45 +++++++ >>>> mm/Kconfig | 6 + >>>> mm/Makefile | 1 + >>>> mm/page_alloc.c | 18 +-- >>>> mm/page_hinting.c | 250 +++++++++++++++++++++++++++++++++++ >>>> 5 files changed, 312 insertions(+), 8 deletions(-) >>>> create mode 100644 include/linux/page_hinting.h >>>> create mode 100644 mm/page_hinting.c >>>> >>>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h >>>> new file mode 100644 >>>> index 000000000000..4900feb796f9 >>>> --- /dev/null >>>> +++ b/include/linux/page_hinting.h >>>> @@ -0,0 +1,45 @@ >>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>> +#ifndef _LINUX_PAGE_HINTING_H >>>> +#define _LINUX_PAGE_HINTING_H >>>> + >>>> +/* >>>> + * Minimum page order required for a page to be hinted to the host. >>>> + */ >>>> +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) >>>> + >>> Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed >>> out earlier for is it due to something else? I'm just wondering if >>> this will have an impact on architectures outside of x86 as I had >>> chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I >>> don't know that the impact of doing that is on other architectures >>> versus the (MAX_ORDER - 2) approach you took here. >> If I am not wrong then any order < (MAX_ORDER - 2) will break the THP. >> That's one reason we decided to stick with this. > That is true for x86, but I don't think that is true for other > architectures. That is why I went with pageblock_order instead of just > using a fixed value such as MAX_ORDER - 2. I see, I will have to check this. > > <snip> > >>>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c >>>> new file mode 100644 >>>> index 000000000000..0bfa09f8c3ed >>>> --- /dev/null >>>> +++ b/mm/page_hinting.c >>>> @@ -0,0 +1,250 @@ >>>> +// SPDX-License-Identifier: GPL-2.0 >>>> +/* >>>> + * Page hinting core infrastructure to enable a VM to report free pages to its >>>> + * hypervisor. >>>> + * >>>> + * Copyright Red Hat, Inc. 2019 >>>> + * >>>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com> >>>> + */ >>>> + >>>> +#include <linux/mm.h> >>>> +#include <linux/slab.h> >>>> +#include <linux/page_hinting.h> >>>> +#include <linux/kvm_host.h> >>>> + >>>> +/* >>>> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the >>>> + * bitmap pointer to track the free pages and other required parameters >>>> + * used to recover these pages by scanning the bitmap. >>>> + * @bitmap: Pointer to the bitmap in PAGE_HINTING_MIN_ORDER >>>> + * granularity. >>>> + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. >>>> + * @end_pfn: Indicates the last PFN value for the zone. >>>> + * @free_pages: Tracks the number of free pages of granularity >>>> + * PAGE_HINTING_MIN_ORDER. >>>> + * @nbits: Indicates the total size of the bitmap in bits allocated >>>> + * at the time of initialization. >>>> + */ >>>> +struct zone_free_area { >>>> + unsigned long *bitmap; >>>> + unsigned long base_pfn; >>>> + unsigned long end_pfn; >>>> + atomic_t free_pages; >>>> + unsigned long nbits; >>>> +} free_area[MAX_NR_ZONES]; >>>> + >>> You still haven't addressed the NUMA issue I pointed out with v10. You >>> are only able to address the first set of zones with this setup. As >>> such you can end up missing large sections of memory if it is split >>> over multiple nodes. >> I think I did. > I just realized what you did. Actually this doesn't really improve > things in my opinion. More comments below. > >>>> +static void init_hinting_wq(struct work_struct *work); >>>> +static DEFINE_MUTEX(page_hinting_init); >>>> +const struct page_hinting_config *page_hitning_conf; >>>> +struct work_struct hinting_work; >>>> +atomic_t page_hinting_active; >>>> + >>>> +void free_area_cleanup(int nr_zones) >>>> +{ >>> I'm not sure why you are passing nr_zones as an argument here. Won't >>> this always be MAX_NR_ZONES? >> free_area_cleanup() gets called from page_hinting_disable() and >> page_hinting_enable(). In page_hinting_enable() when the allocation >> fails we may not have to perform cleanup for all the zones everytime. > Just adding a NULL pointer check to this loop below would still keep > it pretty cheap as the cost for initializing memory to 0 isn't that > high, and this is slow path anyway. Either way I guess it works. Yeah. > You > might want to reset the bitmap pointer to NULL though after you free > it to more easily catch the double free case. I think resetting the bitmap pointer to NULL is a good idea. Thanks. > >>>> + int zone_idx; >>>> + >>>> + for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) { >>>> + bitmap_free(free_area[zone_idx].bitmap); >>>> + free_area[zone_idx].base_pfn = 0; >>>> + free_area[zone_idx].end_pfn = 0; >>>> + free_area[zone_idx].nbits = 0; >>>> + atomic_set(&free_area[zone_idx].free_pages, 0); >>>> + } >>>> +} >>>> + >>>> +int page_hinting_enable(const struct page_hinting_config *conf) >>>> +{ >>>> + unsigned long bitmap_size = 0; >>>> + int zone_idx = 0, ret = -EBUSY; >>>> + struct zone *zone; >>>> + >>>> + mutex_lock(&page_hinting_init); >>>> + if (!page_hitning_conf) { >>>> + for_each_populated_zone(zone) { >>> So for_each_populated_zone will go through all of the NUMA nodes. So >>> if I am not mistaken you will overwrite the free_area values of all >>> the previous nodes with the last node in the system. >> Not sure if I understood. > I misread the code. More comments below. > >>> So if we have a >>> setup that has all the memory in the first node, and none in the >>> second it would effectively disable free page hinting would it not? >> Why will it happen? The base_pfn will still be pointing to the base_pfn >> of the first node. Isn't? > So this does address my concern however, it introduces a new issue. > Specifically you could end up introducing a gap of unused bits if the > memory from one zone is not immediately adjacent to another. This gets > back to the SPARSEMEM issue that I think Dave pointed out. Yeah, he did point it out. It looks a valid issue, I will look into it. > > > <snip> > >>>> +static void scan_zone_free_area(int zone_idx, int free_pages) >>>> +{ >>>> + int ret = 0, order, isolated_cnt = 0; >>>> + unsigned long set_bit, start = 0; >>>> + LIST_HEAD(isolated_pages); >>>> + struct page *page; >>>> + struct zone *zone; >>>> + >>>> + for (;;) { >>>> + ret = 0; >>>> + set_bit = find_next_bit(free_area[zone_idx].bitmap, >>>> + free_area[zone_idx].nbits, start); >>>> + if (set_bit >= free_area[zone_idx].nbits) >>>> + break; >>>> + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + >>>> + free_area[zone_idx].base_pfn); >>>> + if (!page) >>>> + continue; >>>> + zone = page_zone(page); >>>> + spin_lock(&zone->lock); >>>> + >>>> + if (PageBuddy(page) && page_private(page) >= >>>> + PAGE_HINTING_MIN_ORDER) { >>>> + order = page_private(page); >>>> + ret = __isolate_free_page(page, order); >>>> + } >>>> + clear_bit(set_bit, free_area[zone_idx].bitmap); >>>> + atomic_dec(&free_area[zone_idx].free_pages); >>>> + spin_unlock(&zone->lock); >>>> + if (ret) { >>>> + /* >>>> + * restoring page order to use it while releasing >>>> + * the pages back to the buddy. >>>> + */ >>>> + set_page_private(page, order); >>>> + list_add_tail(&page->lru, &isolated_pages); >>>> + isolated_cnt++; >>>> + if (isolated_cnt == page_hitning_conf->max_pages) { >>>> + page_hitning_conf->hint_pages(&isolated_pages); >>>> + release_buddy_pages(&isolated_pages); >>>> + isolated_cnt = 0; >>>> + } >>>> + } >>>> + start = set_bit + 1; >>>> + } >>>> + if (isolated_cnt) { >>>> + page_hitning_conf->hint_pages(&isolated_pages); >>>> + release_buddy_pages(&isolated_pages); >>>> + } >>>> +} >>>> + >>> I really worry that this loop is going to become more expensive as the >>> size of memory increases. For example if we hint on just 16 pages we >>> would have to walk something like 4K bits, 512 longs, if a system had >>> 64G of memory. Have you considered testing with a larger memory >>> footprint to see if it has an impact on performance? >> I am hoping this will be noticeable in will-it-scale's page_fault1, if I >> run it on a larger system? > What you will probably see is that the CPU that is running the scan is > going to be sitting at somewhere near 100% because I cannot see how it > can hope to stay efficient if it has to check something like 512 64b > longs searching for just a handful of idle pages. > >>>> +static void init_hinting_wq(struct work_struct *work) >>>> +{ >>>> + int zone_idx, free_pages; >>>> + >>>> + atomic_set(&page_hinting_active, 1); >>>> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >>>> + free_pages = atomic_read(&free_area[zone_idx].free_pages); >>>> + if (free_pages >= page_hitning_conf->max_pages) >>>> + scan_zone_free_area(zone_idx, free_pages); >>>> + } >>>> + atomic_set(&page_hinting_active, 0); >>>> +} >>>> + >>>> +void page_hinting_enqueue(struct page *page, int order) >>>> +{ >>>> + int zone_idx; >>>> + >>>> + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) >>>> + return; >>> I would think it is going to be expensive to be jumping into this >>> function for every freed page. You should probably have an inline >>> taking care of the order check before you even get here since it would >>> be faster that way. >> I see, I can take a look. Thanks. >>>> + >>>> + bm_set_pfn(page); >>>> + if (atomic_read(&page_hinting_active)) >>>> + return; >>> So I would think this piece is racy. Specifically if you set a PFN >>> that is somewhere below the PFN you are currently processing in your >>> scan it is going to remain unset until you have another page freed >>> after the scan is completed. I would worry you can end up with a batch >>> free of memory resulting in a group of pages sitting at the start of >>> your bitmap unhinted. >> True, but that will be hinted next time threshold is met. > Yes, but that assumes that there is another free immediately coming. > It is possible that you have a big application run and then > immediately shut down and have it free all its memory at once. Worst > case scenario would be that it starts by freeing from the end and > works toward the start. With that you could theoretically end up with > a significant chunk of memory waiting some time for another big free > to come along. Any suggestion on some benchmark/test application which I could run to see this kind of behavior?
On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > > > On 7/11/19 7:20 PM, Alexander Duyck wrote: > > On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > >> > >> On 7/10/19 5:56 PM, Alexander Duyck wrote: > >>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > >>>> This patch introduces the core infrastructure for free page hinting in > >>>> virtual environments. It enables the kernel to track the free pages which > >>>> can be reported to its hypervisor so that the hypervisor could > >>>> free and reuse that memory as per its requirement. > >>>> > >>>> While the pages are getting processed in the hypervisor (e.g., > >>>> via MADV_FREE), the guest must not use them, otherwise, data loss > >>>> would be possible. To avoid such a situation, these pages are > >>>> temporarily removed from the buddy. The amount of pages removed > >>>> temporarily from the buddy is governed by the backend(virtio-balloon > >>>> in our case). > >>>> > >>>> To efficiently identify free pages that can to be hinted to the > >>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big > >>>> chunks are reported to the hypervisor - especially, to not break up THP > >>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits > >>>> in the bitmap are an indication whether a page *might* be free, not a > >>>> guarantee. A new hook after buddy merging sets the bits. > >>>> > >>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue > >>>> asynchronously processes the bitmaps, trying to isolate and report pages > >>>> that are still free. The backend (virtio-balloon) is responsible for > >>>> reporting these batched pages to the host synchronously. Once reporting/ > >>>> freeing is complete, isolated pages are returned back to the buddy. > >>>> > >>>> There are still various things to look into (e.g., memory hotplug, more > >>>> efficient locking, possible races when disabling). > >>>> > >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> > > So just FYI, I thought I would try the patches. It looks like there > > might be a bug somewhere that is causing it to free memory it > > shouldn't be. After about 10 minutes my VM crashed with a system log > > full of various NULL pointer dereferences. > > That's interesting, I have tried the patches with MADV_DONTNEED as well. > I just retried it but didn't see any crash. May I know what kind of > workload you are running? I was running the page_fault1 test on a VM with 80G of memory. > > The only change I had made > > is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers > > didn't have MADV_FREE on the host. It occurs to me one advantage of > > MADV_DONTNEED over MADV_FREE is that you are more likely to catch > > these sort of errors since it zeros the pages instead of leaving them > > intact. > For development purpose maybe. For the final patch-set I think we > discussed earlier why we should keep MADV_FREE. I'm still not convinced MADV_FREE is a net win, at least for performance. You are still paying the cost for the VMEXIT in order to regain ownership of the page. In the case that you are under memory pressure it is essentially equivalent to MADV_DONTNEED. Also it doesn't really do much to help with the memory footprint of the VM itself. With the MADV_DONTNEED the pages are freed back and you have a greater liklihood of reducing the overall memory footprint of the entire system since you would be more likely to be assigned pages that were recently used rather than having to access a cold page. <snip> > >>>> +void page_hinting_enqueue(struct page *page, int order) > >>>> +{ > >>>> + int zone_idx; > >>>> + > >>>> + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) > >>>> + return; > >>> I would think it is going to be expensive to be jumping into this > >>> function for every freed page. You should probably have an inline > >>> taking care of the order check before you even get here since it would > >>> be faster that way. > >> I see, I can take a look. Thanks. > >>>> + > >>>> + bm_set_pfn(page); > >>>> + if (atomic_read(&page_hinting_active)) > >>>> + return; > >>> So I would think this piece is racy. Specifically if you set a PFN > >>> that is somewhere below the PFN you are currently processing in your > >>> scan it is going to remain unset until you have another page freed > >>> after the scan is completed. I would worry you can end up with a batch > >>> free of memory resulting in a group of pages sitting at the start of > >>> your bitmap unhinted. > >> True, but that will be hinted next time threshold is met. > > Yes, but that assumes that there is another free immediately coming. > > It is possible that you have a big application run and then > > immediately shut down and have it free all its memory at once. Worst > > case scenario would be that it starts by freeing from the end and > > works toward the start. With that you could theoretically end up with > > a significant chunk of memory waiting some time for another big free > > to come along. > > Any suggestion on some benchmark/test application which I could run to > see this kind of behavior? Like I mentioned before, try doing a VM with a bigger memory footprint. You could probably just do a stack of VMs like what we were doing with the memhog test. Basically the longer it takes to process all the pages the greater the liklihood that there are still pages left when they are freed.
On 7/12/19 12:22 PM, Alexander Duyck wrote: > On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >> >> On 7/11/19 7:20 PM, Alexander Duyck wrote: >>> On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >>>> On 7/10/19 5:56 PM, Alexander Duyck wrote: >>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >>>>>> This patch introduces the core infrastructure for free page hinting in >>>>>> virtual environments. It enables the kernel to track the free pages which >>>>>> can be reported to its hypervisor so that the hypervisor could >>>>>> free and reuse that memory as per its requirement. >>>>>> >>>>>> While the pages are getting processed in the hypervisor (e.g., >>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss >>>>>> would be possible. To avoid such a situation, these pages are >>>>>> temporarily removed from the buddy. The amount of pages removed >>>>>> temporarily from the buddy is governed by the backend(virtio-balloon >>>>>> in our case). >>>>>> >>>>>> To efficiently identify free pages that can to be hinted to the >>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >>>>>> chunks are reported to the hypervisor - especially, to not break up THP >>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >>>>>> in the bitmap are an indication whether a page *might* be free, not a >>>>>> guarantee. A new hook after buddy merging sets the bits. >>>>>> >>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue >>>>>> asynchronously processes the bitmaps, trying to isolate and report pages >>>>>> that are still free. The backend (virtio-balloon) is responsible for >>>>>> reporting these batched pages to the host synchronously. Once reporting/ >>>>>> freeing is complete, isolated pages are returned back to the buddy. >>>>>> >>>>>> There are still various things to look into (e.g., memory hotplug, more >>>>>> efficient locking, possible races when disabling). >>>>>> >>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> >>> So just FYI, I thought I would try the patches. It looks like there >>> might be a bug somewhere that is causing it to free memory it >>> shouldn't be. After about 10 minutes my VM crashed with a system log >>> full of various NULL pointer dereferences. >> That's interesting, I have tried the patches with MADV_DONTNEED as well. >> I just retried it but didn't see any crash. May I know what kind of >> workload you are running? > I was running the page_fault1 test on a VM with 80G of memory. > >>> The only change I had made >>> is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers >>> didn't have MADV_FREE on the host. It occurs to me one advantage of >>> MADV_DONTNEED over MADV_FREE is that you are more likely to catch >>> these sort of errors since it zeros the pages instead of leaving them >>> intact. >> For development purpose maybe. For the final patch-set I think we >> discussed earlier why we should keep MADV_FREE. > I'm still not convinced MADV_FREE is a net win, at least for > performance. You are still paying the cost for the VMEXIT in order to > regain ownership of the page. In the case that you are under memory > pressure it is essentially equivalent to MADV_DONTNEED. Also it > doesn't really do much to help with the memory footprint of the VM > itself. With the MADV_DONTNEED the pages are freed back and you have a > greater liklihood of reducing the overall memory footprint of the > entire system since you would be more likely to be assigned pages that > were recently used rather than having to access a cold page. > > <snip> > >>>>>> +void page_hinting_enqueue(struct page *page, int order) >>>>>> +{ >>>>>> + int zone_idx; >>>>>> + >>>>>> + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) >>>>>> + return; >>>>> I would think it is going to be expensive to be jumping into this >>>>> function for every freed page. You should probably have an inline >>>>> taking care of the order check before you even get here since it would >>>>> be faster that way. >>>> I see, I can take a look. Thanks. >>>>>> + >>>>>> + bm_set_pfn(page); >>>>>> + if (atomic_read(&page_hinting_active)) >>>>>> + return; >>>>> So I would think this piece is racy. Specifically if you set a PFN >>>>> that is somewhere below the PFN you are currently processing in your >>>>> scan it is going to remain unset until you have another page freed >>>>> after the scan is completed. I would worry you can end up with a batch >>>>> free of memory resulting in a group of pages sitting at the start of >>>>> your bitmap unhinted. >>>> True, but that will be hinted next time threshold is met. >>> Yes, but that assumes that there is another free immediately coming. >>> It is possible that you have a big application run and then >>> immediately shut down and have it free all its memory at once. Worst >>> case scenario would be that it starts by freeing from the end and >>> works toward the start. With that you could theoretically end up with >>> a significant chunk of memory waiting some time for another big free >>> to come along. >> Any suggestion on some benchmark/test application which I could run to >> see this kind of behavior? > Like I mentioned before, try doing a VM with a bigger memory > footprint. You could probably just do a stack of VMs like what we were > doing with the memhog test. Basically the longer it takes to process > all the pages the greater the liklihood that there are still pages > left when they are freed. Thanks. Before next posting I will make sure to test with a larger VM (>64GB).
On 10.07.19 22:45, Dave Hansen wrote: > On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >> +struct zone_free_area { >> + unsigned long *bitmap; >> + unsigned long base_pfn; >> + unsigned long end_pfn; >> + atomic_t free_pages; >> + unsigned long nbits; >> +} free_area[MAX_NR_ZONES]; > > Why do we need an extra data structure. What's wrong with putting > per-zone data in ... 'struct zone'? The cover letter claims that it > doesn't touch core-mm infrastructure, but if it depends on mechanisms > like this, I think that's a very bad thing. > > To be honest, I'm not sure this series is worth reviewing at this point. > It's horribly lightly commented and full of kernel antipatterns lik > > void func() > { > if () { > ... indent entire logic > ... of function > } > } "full of". Hmm. > > It has big "TODO"s. It's virtually comment-free. I'm shocked it's at > the 11th version and still looking like this. > >> + >> + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { >> + unsigned long pages = free_area[zone_idx].end_pfn - >> + free_area[zone_idx].base_pfn; >> + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; >> + if (!bitmap_size) >> + continue; >> + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, >> + GFP_KERNEL); > > This doesn't support sparse zones. We can have zones with massive > spanned page sizes, but very few present pages. On those zones, this > will exhaust memory for no good reason. Yes, AFAIKS, sparse zones are problematic when we have NORMAL/MOVABLE mixed. 1 bit for 2MB, 1 byte for 16MB, 64 bytes for 1GB IOW, this isn't optimal but only really problematic for big systems / very huge sparse zones. > > Comparing this to Alex's patch set, it's of much lower quality and at a > much earlier stage of development. The two sets are not really even > comparable right now. This certainly doesn't sell me on (or even really To be honest, I find this statement quite harsh. Nitesh's hard work in the previous RFC's and many discussions with Alex essentially resulted in the two approaches we have right now. Alex's approach would not look the way it looks today without Nitesh's RFCs. So much to that. > enumerate the deltas in) this approach vs. Alex's. I am aware that memory hotplug is not properly supported yet (future work). Sparse zones work but eventually waste a handful of pages (!) - future work. Anything else you are aware of that is missing? My opinion: 1. Alex' solution is clearly beneficial, as we don't need to manage/scan a bitmap. *however* we were concerned right from the beginning if core-buddy modifications will be accepted upstream for a purely virtualization-specific (as of now!) feature. If we can get it upstream, perfect. Back when we discussed the idea with Alex I was skeptical - I was expecting way more core modifications. 2. We were looking for an alternative solution that doesn't require to modify the buddy. We have that now - yes, some things have to be worked out and cleaned up, not arguing against that. A cleaned-up version of this RFC with some fixes and enhancements should be ready to be used in *many* (not all) setups. Which is perfectly fine. So in summary, I think we should try our best to get Alex's series into shape and accepted upstream. However, if we get upstream resistance or it will take ages to get it in, I think we can start with this series here (which requires no major buddy modifications as of now) and the slowly see if we can convert it into Alex approach. The important part for me is that the core<->driver interface and the virtio interface is in a clean shape, so we can essentially swap out the implementation specific parts in the core. Cheers.
On 11.07.19 20:21, Dave Hansen wrote: > On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >> +static void bm_set_pfn(struct page *page) >> +{ >> + struct zone *zone = page_zone(page); >> + int zone_idx = page_zonenum(page); >> + unsigned long bitnr = 0; >> + >> + lockdep_assert_held(&zone->lock); >> + bitnr = pfn_to_bit(page, zone_idx); >> + /* >> + * TODO: fix possible underflows. >> + */ >> + if (free_area[zone_idx].bitmap && >> + bitnr < free_area[zone_idx].nbits && >> + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) >> + atomic_inc(&free_area[zone_idx].free_pages); >> +} > > Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE > and each zone with 1GB of memory: > > Node: 0 1 > NORMAL 0->1GB 2->3GB > MOVABLE 1->2GB 3->4GB > > This code will allocate two bitmaps. The ZONE_NORMAL bitmap will > represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent > data from 1->4GB. That's the result of this code: > >> + if (free_area[zone_idx].base_pfn) { >> + free_area[zone_idx].base_pfn = >> + min(free_area[zone_idx].base_pfn, >> + zone->zone_start_pfn); >> + free_area[zone_idx].end_pfn = >> + max(free_area[zone_idx].end_pfn, >> + zone->zone_start_pfn + >> + zone->spanned_pages); > > But that means that both bitmaps will have space for PFNs in the other > zone type, which is completely bogus. This is fundamental because the > data structures are incorrectly built per zone *type* instead of per zone. > I don't think it's incorrect, it's just not optimal in all scenarios. E.g., in you example, this approach would "waste" 2 * 1GB of tracking data for the wholes (2* 64bytes when using 1 bit for 2MB). FWIW, this is not a numa-specific thingy. We can have sparse zones easily on single-numa systems. Node: 0 NORMAL 0->1GB, 2->3GB MOVABLE 1->2GB, 3->4GB So tracking it per zones instead instead of zone type is only one part of the story.
On 15.07.19 11:33, David Hildenbrand wrote: > On 11.07.19 20:21, Dave Hansen wrote: >> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >>> +static void bm_set_pfn(struct page *page) >>> +{ >>> + struct zone *zone = page_zone(page); >>> + int zone_idx = page_zonenum(page); >>> + unsigned long bitnr = 0; >>> + >>> + lockdep_assert_held(&zone->lock); >>> + bitnr = pfn_to_bit(page, zone_idx); >>> + /* >>> + * TODO: fix possible underflows. >>> + */ >>> + if (free_area[zone_idx].bitmap && >>> + bitnr < free_area[zone_idx].nbits && >>> + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) >>> + atomic_inc(&free_area[zone_idx].free_pages); >>> +} >> >> Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE >> and each zone with 1GB of memory: >> >> Node: 0 1 >> NORMAL 0->1GB 2->3GB >> MOVABLE 1->2GB 3->4GB >> >> This code will allocate two bitmaps. The ZONE_NORMAL bitmap will >> represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent >> data from 1->4GB. That's the result of this code: >> >>> + if (free_area[zone_idx].base_pfn) { >>> + free_area[zone_idx].base_pfn = >>> + min(free_area[zone_idx].base_pfn, >>> + zone->zone_start_pfn); >>> + free_area[zone_idx].end_pfn = >>> + max(free_area[zone_idx].end_pfn, >>> + zone->zone_start_pfn + >>> + zone->spanned_pages); >> >> But that means that both bitmaps will have space for PFNs in the other >> zone type, which is completely bogus. This is fundamental because the >> data structures are incorrectly built per zone *type* instead of per zone. >> > > I don't think it's incorrect, it's just not optimal in all scenarios. > E.g., in you example, this approach would "waste" 2 * 1GB of tracking > data for the wholes (2* 64bytes when using 1 bit for 2MB). > > FWIW, this is not a numa-specific thingy. We can have sparse zones > easily on single-numa systems. > > Node: 0 > NORMAL 0->1GB, 2->3GB > MOVABLE 1->2GB, 3->4GB > > So tracking it per zones instead instead of zone type is only one part > of the story. > Oh, and FWIW, in setups like Node: 0 1 NORMAL 4->5GB, 6->7GB 5->6GB, 8->9GB What Nitesh proposes is actually better. So it really depends on the use case - but in general sparsity is the issue.
On 7/12/19 12:22 PM, Alexander Duyck wrote: > On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >> >> On 7/11/19 7:20 PM, Alexander Duyck wrote: >>> On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >>>> On 7/10/19 5:56 PM, Alexander Duyck wrote: >>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote: >>>>>> This patch introduces the core infrastructure for free page hinting in >>>>>> virtual environments. It enables the kernel to track the free pages which >>>>>> can be reported to its hypervisor so that the hypervisor could >>>>>> free and reuse that memory as per its requirement. >>>>>> >>>>>> While the pages are getting processed in the hypervisor (e.g., >>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss >>>>>> would be possible. To avoid such a situation, these pages are >>>>>> temporarily removed from the buddy. The amount of pages removed >>>>>> temporarily from the buddy is governed by the backend(virtio-balloon >>>>>> in our case). >>>>>> >>>>>> To efficiently identify free pages that can to be hinted to the >>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >>>>>> chunks are reported to the hypervisor - especially, to not break up THP >>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >>>>>> in the bitmap are an indication whether a page *might* be free, not a >>>>>> guarantee. A new hook after buddy merging sets the bits. >>>>>> >>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue >>>>>> asynchronously processes the bitmaps, trying to isolate and report pages >>>>>> that are still free. The backend (virtio-balloon) is responsible for >>>>>> reporting these batched pages to the host synchronously. Once reporting/ >>>>>> freeing is complete, isolated pages are returned back to the buddy. >>>>>> >>>>>> There are still various things to look into (e.g., memory hotplug, more >>>>>> efficient locking, possible races when disabling). >>>>>> >>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> >>> So just FYI, I thought I would try the patches. It looks like there >>> might be a bug somewhere that is causing it to free memory it >>> shouldn't be. After about 10 minutes my VM crashed with a system log >>> full of various NULL pointer dereferences. >> That's interesting, I have tried the patches with MADV_DONTNEED as well. >> I just retried it but didn't see any crash. May I know what kind of >> workload you are running? > I was running the page_fault1 test on a VM with 80G of memory. > >>> The only change I had made >>> is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers >>> didn't have MADV_FREE on the host. It occurs to me one advantage of >>> MADV_DONTNEED over MADV_FREE is that you are more likely to catch >>> these sort of errors since it zeros the pages instead of leaving them >>> intact. >> For development purpose maybe. For the final patch-set I think we >> discussed earlier why we should keep MADV_FREE. > I'm still not convinced MADV_FREE is a net win, at least for > performance. You are still paying the cost for the VMEXIT in order to > regain ownership of the page. In the case that you are under memory > pressure it is essentially equivalent to MADV_DONTNEED. Also it > doesn't really do much to help with the memory footprint of the VM > itself. With the MADV_DONTNEED the pages are freed back and you have a > greater liklihood of reducing the overall memory footprint of the > entire system since you would be more likely to be assigned pages that > were recently used rather than having to access a cold page. I was able to reproduce this bug and have fixed it. I tried testing the fix by running will-it-scale/page_fault1 for around 12 hours. For now, I have also moved to MADV_DONTNEED. > <snip> > >>>>>> +void page_hinting_enqueue(struct page *page, int order) >>>>>> +{ >>>>>> + int zone_idx; >>>>>> + >>>>>> + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) >>>>>> + return; >>>>> I would think it is going to be expensive to be jumping into this >>>>> function for every freed page. You should probably have an inline >>>>> taking care of the order check before you even get here since it would >>>>> be faster that way. >>>> I see, I can take a look. Thanks. >>>>>> + >>>>>> + bm_set_pfn(page); >>>>>> + if (atomic_read(&page_hinting_active)) >>>>>> + return; >>>>> So I would think this piece is racy. Specifically if you set a PFN >>>>> that is somewhere below the PFN you are currently processing in your >>>>> scan it is going to remain unset until you have another page freed >>>>> after the scan is completed. I would worry you can end up with a batch >>>>> free of memory resulting in a group of pages sitting at the start of >>>>> your bitmap unhinted. >>>> True, but that will be hinted next time threshold is met. >>> Yes, but that assumes that there is another free immediately coming. >>> It is possible that you have a big application run and then >>> immediately shut down and have it free all its memory at once. Worst >>> case scenario would be that it starts by freeing from the end and >>> works toward the start. With that you could theoretically end up with >>> a significant chunk of memory waiting some time for another big free >>> to come along. >> Any suggestion on some benchmark/test application which I could run to >> see this kind of behavior? > Like I mentioned before, try doing a VM with a bigger memory > footprint. You could probably just do a stack of VMs like what we were > doing with the memhog test. Basically the longer it takes to process > all the pages the greater the liklihood that there are still pages > left when they are freed.
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h new file mode 100644 index 000000000000..4900feb796f9 --- /dev/null +++ b/include/linux/page_hinting.h @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PAGE_HINTING_H +#define _LINUX_PAGE_HINTING_H + +/* + * Minimum page order required for a page to be hinted to the host. + */ +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2) + +/* + * struct page_hinting_config: holds the information supplied by the balloon + * device to page hinting. + * @hint_pages: Callback which reports the isolated pages + * synchornously to the host. + * @max_pages: Maxmimum pages that are going to be hinted to the host + * at a time of granularity >= PAGE_HINTING_MIN_ORDER. + */ +struct page_hinting_config { + void (*hint_pages)(struct list_head *list); + int max_pages; +}; + +extern int __isolate_free_page(struct page *page, unsigned int order); +extern void __free_one_page(struct page *page, unsigned long pfn, + struct zone *zone, unsigned int order, + int migratetype, bool hint); +#ifdef CONFIG_PAGE_HINTING +void page_hinting_enqueue(struct page *page, int order); +int page_hinting_enable(const struct page_hinting_config *conf); +void page_hinting_disable(void); +#else +static inline void page_hinting_enqueue(struct page *page, int order) +{ +} + +static inline int page_hinting_enable(const struct page_hinting_config *conf) +{ + return -EOPNOTSUPP; +} + +static inline void page_hinting_disable(void) +{ +} +#endif +#endif /* _LINUX_PAGE_HINTING_H */ diff --git a/mm/Kconfig b/mm/Kconfig index f0c76ba47695..e97fab429d9b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -765,4 +765,10 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool +# PAGE_HINTING will allow the guest to report the free pages to the +# host in fixed chunks as soon as the threshold is reached. +config PAGE_HINTING + bool + def_bool n + depends on X86_64 endmenu diff --git a/mm/Makefile b/mm/Makefile index ac5e5ba78874..73be49177656 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d66bc8abe0af..8a44338bd04e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -69,6 +69,7 @@ #include <linux/lockdep.h> #include <linux/nmi.h> #include <linux/psi.h> +#include <linux/page_hinting.h> #include <asm/sections.h> #include <asm/tlbflush.h> @@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page, * -- nyc */ -static inline void __free_one_page(struct page *page, +inline void __free_one_page(struct page *page, unsigned long pfn, struct zone *zone, unsigned int order, - int migratetype) + int migratetype, bool hint) { unsigned long combined_pfn; unsigned long uninitialized_var(buddy_pfn); @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page, migratetype); else add_to_free_area(page, &zone->free_area[order], migratetype); - + if (hint) + page_hinting_enqueue(page, order); } /* @@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (unlikely(isolated_pageblocks)) mt = get_pageblock_migratetype(page); - __free_one_page(page, page_to_pfn(page), zone, 0, mt); + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); trace_mm_page_pcpu_drain(page, 0, mt); } spin_unlock(&zone->lock); @@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count, static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn, unsigned int order, - int migratetype) + int migratetype, bool hint) { spin_lock(&zone->lock); if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(migratetype))) { migratetype = get_pfnblock_migratetype(page, pfn); } - __free_one_page(page, pfn, zone, order, migratetype); + __free_one_page(page, pfn, zone, order, migratetype, hint); spin_unlock(&zone->lock); } @@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) migratetype = get_pfnblock_migratetype(page, pfn); local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); - free_one_page(page_zone(page), page, pfn, order, migratetype); + free_one_page(page_zone(page), page, pfn, order, migratetype, true); local_irq_restore(flags); } @@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) */ if (migratetype >= MIGRATE_PCPTYPES) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(zone, page, pfn, 0, migratetype); + free_one_page(zone, page, pfn, 0, migratetype, true); return; } migratetype = MIGRATE_MOVABLE; diff --git a/mm/page_hinting.c b/mm/page_hinting.c new file mode 100644 index 000000000000..0bfa09f8c3ed --- /dev/null +++ b/mm/page_hinting.c @@ -0,0 +1,250 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Page hinting core infrastructure to enable a VM to report free pages to its + * hypervisor. + * + * Copyright Red Hat, Inc. 2019 + * + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com> + */ + +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/page_hinting.h> +#include <linux/kvm_host.h> + +/* + * struct zone_free_area: For a single zone across NUMA nodes, it holds the + * bitmap pointer to track the free pages and other required parameters + * used to recover these pages by scanning the bitmap. + * @bitmap: Pointer to the bitmap in PAGE_HINTING_MIN_ORDER + * granularity. + * @base_pfn: Starting PFN value for the zone whose bitmap is stored. + * @end_pfn: Indicates the last PFN value for the zone. + * @free_pages: Tracks the number of free pages of granularity + * PAGE_HINTING_MIN_ORDER. + * @nbits: Indicates the total size of the bitmap in bits allocated + * at the time of initialization. + */ +struct zone_free_area { + unsigned long *bitmap; + unsigned long base_pfn; + unsigned long end_pfn; + atomic_t free_pages; + unsigned long nbits; +} free_area[MAX_NR_ZONES]; + +static void init_hinting_wq(struct work_struct *work); +static DEFINE_MUTEX(page_hinting_init); +const struct page_hinting_config *page_hitning_conf; +struct work_struct hinting_work; +atomic_t page_hinting_active; + +void free_area_cleanup(int nr_zones) +{ + int zone_idx; + + for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) { + bitmap_free(free_area[zone_idx].bitmap); + free_area[zone_idx].base_pfn = 0; + free_area[zone_idx].end_pfn = 0; + free_area[zone_idx].nbits = 0; + atomic_set(&free_area[zone_idx].free_pages, 0); + } +} + +int page_hinting_enable(const struct page_hinting_config *conf) +{ + unsigned long bitmap_size = 0; + int zone_idx = 0, ret = -EBUSY; + struct zone *zone; + + mutex_lock(&page_hinting_init); + if (!page_hitning_conf) { + for_each_populated_zone(zone) { + zone_idx = zone_idx(zone); +#ifdef CONFIG_ZONE_DEVICE + if (zone_idx == ZONE_DEVICE) + continue; +#endif + spin_lock(&zone->lock); + if (free_area[zone_idx].base_pfn) { + free_area[zone_idx].base_pfn = + min(free_area[zone_idx].base_pfn, + zone->zone_start_pfn); + free_area[zone_idx].end_pfn = + max(free_area[zone_idx].end_pfn, + zone->zone_start_pfn + + zone->spanned_pages); + } else { + free_area[zone_idx].base_pfn = + zone->zone_start_pfn; + free_area[zone_idx].end_pfn = + zone->zone_start_pfn + + zone->spanned_pages; + } + spin_unlock(&zone->lock); + } + + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { + unsigned long pages = free_area[zone_idx].end_pfn - + free_area[zone_idx].base_pfn; + bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1; + if (!bitmap_size) + continue; + free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size, + GFP_KERNEL); + if (!free_area[zone_idx].bitmap) { + free_area_cleanup(zone_idx); + mutex_unlock(&page_hinting_init); + return -ENOMEM; + } + free_area[zone_idx].nbits = bitmap_size; + } + page_hitning_conf = conf; + INIT_WORK(&hinting_work, init_hinting_wq); + ret = 0; + } + mutex_unlock(&page_hinting_init); + return ret; +} +EXPORT_SYMBOL_GPL(page_hinting_enable); + +void page_hinting_disable(void) +{ + cancel_work_sync(&hinting_work); + page_hitning_conf = NULL; + free_area_cleanup(MAX_NR_ZONES); +} +EXPORT_SYMBOL_GPL(page_hinting_disable); + +static unsigned long pfn_to_bit(struct page *page, int zone_idx) +{ + unsigned long bitnr; + + bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn) + >> PAGE_HINTING_MIN_ORDER; + return bitnr; +} + +static void release_buddy_pages(struct list_head *pages) +{ + int mt = 0, zone_idx, order; + struct page *page, *next; + unsigned long bitnr; + struct zone *zone; + + list_for_each_entry_safe(page, next, pages, lru) { + zone_idx = page_zonenum(page); + zone = page_zone(page); + bitnr = pfn_to_bit(page, zone_idx); + spin_lock(&zone->lock); + list_del(&page->lru); + order = page_private(page); + set_page_private(page, 0); + mt = get_pageblock_migratetype(page); + __free_one_page(page, page_to_pfn(page), zone, + order, mt, false); + spin_unlock(&zone->lock); + } +} + +static void bm_set_pfn(struct page *page) +{ + struct zone *zone = page_zone(page); + int zone_idx = page_zonenum(page); + unsigned long bitnr = 0; + + lockdep_assert_held(&zone->lock); + bitnr = pfn_to_bit(page, zone_idx); + /* + * TODO: fix possible underflows. + */ + if (free_area[zone_idx].bitmap && + bitnr < free_area[zone_idx].nbits && + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) + atomic_inc(&free_area[zone_idx].free_pages); +} + +static void scan_zone_free_area(int zone_idx, int free_pages) +{ + int ret = 0, order, isolated_cnt = 0; + unsigned long set_bit, start = 0; + LIST_HEAD(isolated_pages); + struct page *page; + struct zone *zone; + + for (;;) { + ret = 0; + set_bit = find_next_bit(free_area[zone_idx].bitmap, + free_area[zone_idx].nbits, start); + if (set_bit >= free_area[zone_idx].nbits) + break; + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) + + free_area[zone_idx].base_pfn); + if (!page) + continue; + zone = page_zone(page); + spin_lock(&zone->lock); + + if (PageBuddy(page) && page_private(page) >= + PAGE_HINTING_MIN_ORDER) { + order = page_private(page); + ret = __isolate_free_page(page, order); + } + clear_bit(set_bit, free_area[zone_idx].bitmap); + atomic_dec(&free_area[zone_idx].free_pages); + spin_unlock(&zone->lock); + if (ret) { + /* + * restoring page order to use it while releasing + * the pages back to the buddy. + */ + set_page_private(page, order); + list_add_tail(&page->lru, &isolated_pages); + isolated_cnt++; + if (isolated_cnt == page_hitning_conf->max_pages) { + page_hitning_conf->hint_pages(&isolated_pages); + release_buddy_pages(&isolated_pages); + isolated_cnt = 0; + } + } + start = set_bit + 1; + } + if (isolated_cnt) { + page_hitning_conf->hint_pages(&isolated_pages); + release_buddy_pages(&isolated_pages); + } +} + +static void init_hinting_wq(struct work_struct *work) +{ + int zone_idx, free_pages; + + atomic_set(&page_hinting_active, 1); + for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) { + free_pages = atomic_read(&free_area[zone_idx].free_pages); + if (free_pages >= page_hitning_conf->max_pages) + scan_zone_free_area(zone_idx, free_pages); + } + atomic_set(&page_hinting_active, 0); +} + +void page_hinting_enqueue(struct page *page, int order) +{ + int zone_idx; + + if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER) + return; + + bm_set_pfn(page); + if (atomic_read(&page_hinting_active)) + return; + zone_idx = zone_idx(page_zone(page)); + if (atomic_read(&free_area[zone_idx].free_pages) >= + page_hitning_conf->max_pages) { + int cpu = smp_processor_id(); + + queue_work_on(cpu, system_wq, &hinting_work); + } +}
This patch introduces the core infrastructure for free page hinting in virtual environments. It enables the kernel to track the free pages which can be reported to its hypervisor so that the hypervisor could free and reuse that memory as per its requirement. While the pages are getting processed in the hypervisor (e.g., via MADV_FREE), the guest must not use them, otherwise, data loss would be possible. To avoid such a situation, these pages are temporarily removed from the buddy. The amount of pages removed temporarily from the buddy is governed by the backend(virtio-balloon in our case). To efficiently identify free pages that can to be hinted to the hypervisor, bitmaps in a coarse granularity are used. Only fairly big chunks are reported to the hypervisor - especially, to not break up THP in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits in the bitmap are an indication whether a page *might* be free, not a guarantee. A new hook after buddy merging sets the bits. Bitmaps are stored per zone, protected by the zone lock. A workqueue asynchronously processes the bitmaps, trying to isolate and report pages that are still free. The backend (virtio-balloon) is responsible for reporting these batched pages to the host synchronously. Once reporting/ freeing is complete, isolated pages are returned back to the buddy. There are still various things to look into (e.g., memory hotplug, more efficient locking, possible races when disabling). Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> --- include/linux/page_hinting.h | 45 +++++++ mm/Kconfig | 6 + mm/Makefile | 1 + mm/page_alloc.c | 18 +-- mm/page_hinting.c | 250 +++++++++++++++++++++++++++++++++++ 5 files changed, 312 insertions(+), 8 deletions(-) create mode 100644 include/linux/page_hinting.h create mode 100644 mm/page_hinting.c