Message ID | 20200211224635.29318.19750.stgit@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm / virtio: Provide support for free page reporting | expand |
On Tue, Feb 11, 2020 at 02:46:35PM -0800, Alexander Duyck wrote: > diff --git a/mm/page_reporting.c b/mm/page_reporting.c > new file mode 100644 > index 000000000000..1047c6872d4f > --- /dev/null > +++ b/mm/page_reporting.c > @@ -0,0 +1,319 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include <linux/mm.h> > +#include <linux/mmzone.h> > +#include <linux/page_reporting.h> > +#include <linux/gfp.h> > +#include <linux/export.h> > +#include <linux/delay.h> > +#include <linux/scatterlist.h> > + > +#include "page_reporting.h" > +#include "internal.h" > + > +#define PAGE_REPORTING_DELAY (2 * HZ) I assume there is nothing special about 2 seconds other than "do some progress every so often". > +static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; > + > +enum { > + PAGE_REPORTING_IDLE = 0, > + PAGE_REPORTING_REQUESTED, > + PAGE_REPORTING_ACTIVE > +}; > + > +/* request page reporting */ > +static void > +__page_reporting_request(struct page_reporting_dev_info *prdev) > +{ > + unsigned int state; > + > + /* Check to see if we are in desired state */ > + state = atomic_read(&prdev->state); > + if (state == PAGE_REPORTING_REQUESTED) > + return; > + > + /* > + * If reporting is already active there is nothing we need to do. > + * Test against 0 as that represents PAGE_REPORTING_IDLE. > + */ > + state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED); > + if (state != PAGE_REPORTING_IDLE) > + return; > + > + /* > + * Delay the start of work to allow a sizable queue to build. For > + * now we are limiting this to running no more than once every > + * couple of seconds. > + */ > + schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY); > +} Seems a fair use of atomics. > +static int > +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, > + unsigned int order, unsigned int mt, > + struct scatterlist *sgl, unsigned int *offset) > +{ > + struct free_area *area = &zone->free_area[order]; > + struct list_head *list = &area->free_list[mt]; > + unsigned int page_len = PAGE_SIZE << order; > + struct page *page, *next; > + int err = 0; > + > + /* > + * Perform early check, if free area is empty there is > + * nothing to process so we can skip this free_list. > + */ > + if (list_empty(list)) > + return err; > + > + spin_lock_irq(&zone->lock); > + > + /* loop through free list adding unreported pages to sg list */ > + list_for_each_entry_safe(page, next, list, lru) { > + /* We are going to skip over the reported pages. */ > + if (PageReported(page)) > + continue; > + > + /* Attempt to pull page from list */ > + if (!__isolate_free_page(page, order)) > + break; > + Might want to note that you are breaking because the only reason to fail the isolation is that watermarks are not met and we are likely under memory pressure. It's not a big issue. However, while I think this is correct, it's hard to follow. This loop can be broken out of with pages still on the scatter gather list. The current flow guarantees that err will not be set at this point so the caller cleans it up so we always drain the list either here or in the caller. While I think it works, it's a bit fragile. I recommend putting a comment above this noting why it's safe and put a VM_WARN_ON_ONCE(err) before the break in case someone tries to change this in a years time and does not spot that the flow to reach page_reporting_drain *somewhere* is critical. > + /* Add page to scatter list */ > + --(*offset); > + sg_set_page(&sgl[*offset], page, page_len, 0); > + > + /* If scatterlist isn't full grab more pages */ > + if (*offset) > + continue; > + > + /* release lock before waiting on report processing */ > + spin_unlock_irq(&zone->lock); > + > + /* begin processing pages in local list */ > + err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY); > + > + /* reset offset since the full list was reported */ > + *offset = PAGE_REPORTING_CAPACITY; > + > + /* reacquire zone lock and resume processing */ > + spin_lock_irq(&zone->lock); > + > + /* flush reported pages from the sg list */ > + page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err); > + > + /* > + * Reset next to first entry, the old next isn't valid > + * since we dropped the lock to report the pages > + */ > + next = list_first_entry(list, struct page, lru); > + > + /* exit on error */ > + if (err) > + break; > + } > + > + spin_unlock_irq(&zone->lock); > + > + return err; > +} I complained about the use of zone lock before but in this version, I think I'm ok with it. The lock is held for the free list manipulations which is what it's for. The state management with atomics seems reasonable. Otherwise I think this is ok and I think the implementation right. Of great importance to me was the allocator fast paths but they seem to be adequately protected by a static branch so Acked-by: Mel Gorman <mgorman@techsingularity.net> The ack applies regardless of whether you decide to document and defensively protect page_reporting_cycle against losing pages on the scatter/gather list but I do recommend it.
On Wed, 2020-02-19 at 14:55 +0000, Mel Gorman wrote: > On Tue, Feb 11, 2020 at 02:46:35PM -0800, Alexander Duyck wrote: > > diff --git a/mm/page_reporting.c b/mm/page_reporting.c > > new file mode 100644 > > index 000000000000..1047c6872d4f > > --- /dev/null > > +++ b/mm/page_reporting.c > > @@ -0,0 +1,319 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include <linux/mm.h> > > +#include <linux/mmzone.h> > > +#include <linux/page_reporting.h> > > +#include <linux/gfp.h> > > +#include <linux/export.h> > > +#include <linux/delay.h> > > +#include <linux/scatterlist.h> > > + > > +#include "page_reporting.h" > > +#include "internal.h" > > + > > +#define PAGE_REPORTING_DELAY (2 * HZ) > > I assume there is nothing special about 2 seconds other than "do some > progress every so often". Yes, nothing special. I played around with a few different values. I just settled on 2 seconds as I figured with that and 1/16 of the list per pass it came out to about 30 seconds which I felt is about the right time for a fully utilized system to settle back to the inactive state. > > > > +static int > > +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, > > + unsigned int order, unsigned int mt, > > + struct scatterlist *sgl, unsigned int *offset) > > +{ > > + struct free_area *area = &zone->free_area[order]; > > + struct list_head *list = &area->free_list[mt]; > > + unsigned int page_len = PAGE_SIZE << order; > > + struct page *page, *next; > > + int err = 0; > > + > > + /* > > + * Perform early check, if free area is empty there is > > + * nothing to process so we can skip this free_list. > > + */ > > + if (list_empty(list)) > > + return err; > > + > > + spin_lock_irq(&zone->lock); > > + > > + /* loop through free list adding unreported pages to sg list */ > > + list_for_each_entry_safe(page, next, list, lru) { > > + /* We are going to skip over the reported pages. */ > > + if (PageReported(page)) > > + continue; > > + > > + /* Attempt to pull page from list */ > > + if (!__isolate_free_page(page, order)) > > + break; > > + > > Might want to note that you are breaking because the only reason to fail > the isolation is that watermarks are not met and we are likely under > memory pressure. It's not a big issue. > > However, while I think this is correct, it's hard to follow. This loop can > be broken out of with pages still on the scatter gather list. The current > flow guarantees that err will not be set at this point so the caller > cleans it up so we always drain the list either here or in the caller. I can probably submit a follow-up patch to update the comments. The reason for not returning an error is because I didn't consider it an error that we encountered the watermark and were not able to pull any more pages. Instead I considered that the "stop" point for this pass and have it just exit out of the loop and flush the data. At the start of the next pass we will check against the low watermark instead of the minimum watermark and if that check fails we will simply stop reporting pages for the zone until additional pages are freed. I can probably also update the description for page_reporting_cycle since it may not be clear that the output for this is a partially filled in- progress scatterlist so we always have to reporting any remaining entries at the end of processing a given zone. It might make more sense if I move the bits related to "leftover" in page_reporting_process_zone into their own function. > While I think it works, it's a bit fragile. I recommend putting a comment > above this noting why it's safe and put a VM_WARN_ON_ONCE(err) before the > break in case someone tries to change this in a years time and does not > spot that the flow to reach page_reporting_drain *somewhere* is critical. I assume this isn't about this section, but the section below? > > + /* Add page to scatter list */ > > + --(*offset); > > + sg_set_page(&sgl[*offset], page, page_len, 0); > > + > > + /* If scatterlist isn't full grab more pages */ > > + if (*offset) > > + continue; > > + > > + /* release lock before waiting on report processing */ > > + spin_unlock_irq(&zone->lock); > > + > > + /* begin processing pages in local list */ > > + err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY); > > + So one thing I can do is probably add a comment here as well to more thoroughly explain the reason why we wait to call the break until we are in the block below. > > + /* reset offset since the full list was reported */ > > + *offset = PAGE_REPORTING_CAPACITY; > > + > > + /* reacquire zone lock and resume processing */ > > + spin_lock_irq(&zone->lock); > > + > > + /* flush reported pages from the sg list */ > > + page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err); > > + > > + /* > > + * Reset next to first entry, the old next isn't valid > > + * since we dropped the lock to report the pages > > + */ > > + next = list_first_entry(list, struct page, lru); > > + > > + /* exit on error */ > > + if (err) > > + break; And I assume you meant to add the VM_WARN_ON_ONCE here? The statement above wouldn't make much sense since err would always be 0. > > + } > > + > > + spin_unlock_irq(&zone->lock); > > + > > + return err; > > +} > > I complained about the use of zone lock before but in this version, I > think I'm ok with it. The lock is held for the free list manipulations > which is what it's for. The state management with atomics seems > reasonable. > > Otherwise I think this is ok and I think the implementation right. Of > great importance to me was the allocator fast paths but they seem to be > adequately protected by a static branch so > > Acked-by: Mel Gorman <mgorman@techsingularity.net> > > The ack applies regardless of whether you decide to document and > defensively protect page_reporting_cycle against losing pages on the > scatter/gather list but I do recommend it. Thanks for reviewing this. I appreciate the feedback. - Alex
On Thu, Feb 20, 2020 at 10:44:21AM -0800, Alexander Duyck wrote: > > > +static int > > > +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, > > > + unsigned int order, unsigned int mt, > > > + struct scatterlist *sgl, unsigned int *offset) > > > +{ > > > + struct free_area *area = &zone->free_area[order]; > > > + struct list_head *list = &area->free_list[mt]; > > > + unsigned int page_len = PAGE_SIZE << order; > > > + struct page *page, *next; > > > + int err = 0; > > > + > > > + /* > > > + * Perform early check, if free area is empty there is > > > + * nothing to process so we can skip this free_list. > > > + */ > > > + if (list_empty(list)) > > > + return err; > > > + > > > + spin_lock_irq(&zone->lock); > > > + > > > + /* loop through free list adding unreported pages to sg list */ > > > + list_for_each_entry_safe(page, next, list, lru) { > > > + /* We are going to skip over the reported pages. */ > > > + if (PageReported(page)) > > > + continue; > > > + > > > + /* Attempt to pull page from list */ > > > + if (!__isolate_free_page(page, order)) > > > + break; > > > + > > > > Might want to note that you are breaking because the only reason to fail > > the isolation is that watermarks are not met and we are likely under > > memory pressure. It's not a big issue. > > > > However, while I think this is correct, it's hard to follow. This loop can > > be broken out of with pages still on the scatter gather list. The current > > flow guarantees that err will not be set at this point so the caller > > cleans it up so we always drain the list either here or in the caller. > > I can probably submit a follow-up patch to update the comments. The reason > for not returning an error is because I didn't consider it an error that > we encountered the watermark and were not able to pull any more pages. > Instead I considered that the "stop" point for this pass and have it just > exit out of the loop and flush the data. > I don't consider it an error and I don't think you should return an error. The comment just needs to explain that the draining happens in the caller in this case. That should be enough of a warning to a future developer to double check the flow after any changes to make sure the drain is reached. > > While I think it works, it's a bit fragile. I recommend putting a comment > > above this noting why it's safe and put a VM_WARN_ON_ONCE(err) before the > > break in case someone tries to change this in a years time and does not > > spot that the flow to reach page_reporting_drain *somewhere* is critical. > > I assume this isn't about this section, but the section below? > I meant something like if (!__isolate_free_page(page, order)) { VM_WARN_ON_ONCE(err); break; } Because at this point it's possible there are entries that should go through page_reporting_drain() but the caller will not call page_reporting_drain() in the event of an error.
On Thu, 2020-02-20 at 22:35 +0000, Mel Gorman wrote: > On Thu, Feb 20, 2020 at 10:44:21AM -0800, Alexander Duyck wrote: > > > > +static int > > > > +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, > > > > + unsigned int order, unsigned int mt, > > > > + struct scatterlist *sgl, unsigned int *offset) > > > > +{ > > > > + struct free_area *area = &zone->free_area[order]; > > > > + struct list_head *list = &area->free_list[mt]; > > > > + unsigned int page_len = PAGE_SIZE << order; > > > > + struct page *page, *next; > > > > + int err = 0; > > > > + > > > > + /* > > > > + * Perform early check, if free area is empty there is > > > > + * nothing to process so we can skip this free_list. > > > > + */ > > > > + if (list_empty(list)) > > > > + return err; > > > > + > > > > + spin_lock_irq(&zone->lock); > > > > + > > > > + /* loop through free list adding unreported pages to sg list */ > > > > + list_for_each_entry_safe(page, next, list, lru) { > > > > + /* We are going to skip over the reported pages. */ > > > > + if (PageReported(page)) > > > > + continue; > > > > + > > > > + /* Attempt to pull page from list */ > > > > + if (!__isolate_free_page(page, order)) > > > > + break; > > > > + > > > > > > Might want to note that you are breaking because the only reason to fail > > > the isolation is that watermarks are not met and we are likely under > > > memory pressure. It's not a big issue. > > > > > > However, while I think this is correct, it's hard to follow. This loop can > > > be broken out of with pages still on the scatter gather list. The current > > > flow guarantees that err will not be set at this point so the caller > > > cleans it up so we always drain the list either here or in the caller. > > > > I can probably submit a follow-up patch to update the comments. The reason > > for not returning an error is because I didn't consider it an error that > > we encountered the watermark and were not able to pull any more pages. > > Instead I considered that the "stop" point for this pass and have it just > > exit out of the loop and flush the data. > > > > I don't consider it an error and I don't think you should return an > error. The comment just needs to explain that the draining happens in > the caller in this case. That should be enough of a warning to a future > developer to double check the flow after any changes to make sure the > drain is reached. The comment I can do, that shouldn't be an issue. The point I was getting at is that a separate drain call is expected for this any time the function is not returning an error, and the only way it can return an error is if there was a reporting issue. > > > While I think it works, it's a bit fragile. I recommend putting a comment > > > above this noting why it's safe and put a VM_WARN_ON_ONCE(err) before the > > > break in case someone tries to change this in a years time and does not > > > spot that the flow to reach page_reporting_drain *somewhere* is critical. > > > > I assume this isn't about this section, but the section below? > > > > I meant something like > > if (!__isolate_free_page(page, order)) { > VM_WARN_ON_ONCE(err); > break; > } > > Because at this point it's possible there are entries that should go > through page_reporting_drain() but the caller will not call > page_reporting_drain() in the event of an error. I would think adding that would confuse things even more. There is a break statement at the end of the loop that will break out if err is set. So we should never hit the VM_WARN_ON_ONCE because err should always be 0 before we even attempt to isolate the page. I think something like the following would probably make more sense: err = page_reporting_cycle(prdev, zone, order, mt, sgl, &offset); if (err) { /* * We should have drained the scatterlist * prior to exiting page_reporting_cycle if * we encountered an error. If we did not * then this could result in a memory leak. * Verify that the end of the scatterlist * was cleared prior to us getting here. */ sgl = &sgl[PAGE_REPORTING_CAPACITY - 1]; VM_WARN_ON_ONCE(sg_page(sgl)); return err; } With that we are more-or-less making certain that they called page_reporting_drain which will zero the scatterlist.
On Fri, Feb 21, 2020 at 11:25:49AM -0800, Alexander Duyck wrote: > On Thu, 2020-02-20 at 22:35 +0000, Mel Gorman wrote: > > On Thu, Feb 20, 2020 at 10:44:21AM -0800, Alexander Duyck wrote: > > > > > +static int > > > > > +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, > > > > > + unsigned int order, unsigned int mt, > > > > > + struct scatterlist *sgl, unsigned int *offset) > > > > > +{ > > > > > + struct free_area *area = &zone->free_area[order]; > > > > > + struct list_head *list = &area->free_list[mt]; > > > > > + unsigned int page_len = PAGE_SIZE << order; > > > > > + struct page *page, *next; > > > > > + int err = 0; > > > > > + > > > > > + /* > > > > > + * Perform early check, if free area is empty there is > > > > > + * nothing to process so we can skip this free_list. > > > > > + */ > > > > > + if (list_empty(list)) > > > > > + return err; > > > > > + > > > > > + spin_lock_irq(&zone->lock); > > > > > + > > > > > + /* loop through free list adding unreported pages to sg list */ > > > > > + list_for_each_entry_safe(page, next, list, lru) { > > > > > + /* We are going to skip over the reported pages. */ > > > > > + if (PageReported(page)) > > > > > + continue; > > > > > + > > > > > + /* Attempt to pull page from list */ > > > > > + if (!__isolate_free_page(page, order)) > > > > > + break; > > > > > + > > > > > > > > Might want to note that you are breaking because the only reason to fail > > > > the isolation is that watermarks are not met and we are likely under > > > > memory pressure. It's not a big issue. > > > > > > > > However, while I think this is correct, it's hard to follow. This loop can > > > > be broken out of with pages still on the scatter gather list. The current > > > > flow guarantees that err will not be set at this point so the caller > > > > cleans it up so we always drain the list either here or in the caller. > > > > > > I can probably submit a follow-up patch to update the comments. The reason > > > for not returning an error is because I didn't consider it an error that > > > we encountered the watermark and were not able to pull any more pages. > > > Instead I considered that the "stop" point for this pass and have it just > > > exit out of the loop and flush the data. > > > > > > > I don't consider it an error and I don't think you should return an > > error. The comment just needs to explain that the draining happens in > > the caller in this case. That should be enough of a warning to a future > > developer to double check the flow after any changes to make sure the > > drain is reached. > > The comment I can do, that shouldn't be an issue. The point I was getting > at is that a separate drain call is expected for this any time the > function is not returning an error, and the only way it can return an > error is if there was a reporting issue. > I'm not suggesting you return an error. I'm suggesting you put a warn in before you break due to watermarks *if* there is an error. It should *never* trigger unless someone modifies the flow and breaks it in which case the warning will not kill the system but give a strong hint to the developer that they need to think a bit more. It's ok to leave it out because at this point, it's a distraction and I do not see a problem with the current code.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1bf83c8fcaa7..49c2697046b9 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -163,6 +163,9 @@ enum pageflags { /* non-lru isolated movable page */ PG_isolated = PG_reclaim, + + /* Only valid for buddy pages. Used to track pages that are reported */ + PG_reported = PG_uptodate, }; #ifndef __GENERATING_BOUNDS_H @@ -432,6 +435,14 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page) #endif /* + * PageReported() is used to track reported free pages within the Buddy + * allocator. We can use the non-atomic version of the test and set + * operations as both should be shielded with the zone lock to prevent + * any possible races on the setting or clearing of the bit. + */ +__PAGEFLAG(Reported, reported, PF_NO_COMPOUND) + +/* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h new file mode 100644 index 000000000000..32355486f572 --- /dev/null +++ b/include/linux/page_reporting.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PAGE_REPORTING_H +#define _LINUX_PAGE_REPORTING_H + +#include <linux/mmzone.h> +#include <linux/scatterlist.h> + +#define PAGE_REPORTING_CAPACITY 32 + +struct page_reporting_dev_info { + /* function that alters pages to make them "reported" */ + int (*report)(struct page_reporting_dev_info *prdev, + struct scatterlist *sg, unsigned int nents); + + /* work struct for processing reports */ + struct delayed_work work; + + /* Current state of page reporting */ + atomic_t state; +}; + +/* Tear-down and bring-up for page reporting devices */ +void page_reporting_unregister(struct page_reporting_dev_info *prdev); +int page_reporting_register(struct page_reporting_dev_info *prdev); +#endif /*_LINUX_PAGE_REPORTING_H */ diff --git a/mm/Kconfig b/mm/Kconfig index ab80933be65f..d40a873402ff 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -237,6 +237,17 @@ config COMPACTION linux-mm@kvack.org. # +# support for free page reporting +config PAGE_REPORTING + bool "Free page reporting" + def_bool n + help + Free page reporting allows for the incremental acquisition of + free pages from the buddy allocator for the purpose of reporting + those pages to another entity, such as a hypervisor, so that the + memory can be freed within the host for other uses. + +# # support for page migration # config MIGRATION diff --git a/mm/Makefile b/mm/Makefile index c9696f3ec840..7b5eec34d0e9 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -118,3 +118,4 @@ obj-$(CONFIG_HMM_MIRROR) += hmm.o obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b711fc0159d9..1142b2f91377 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -74,6 +74,7 @@ #include <asm/div64.h> #include "internal.h" #include "shuffle.h" +#include "page_reporting.h" /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); @@ -902,6 +903,10 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, static inline void del_page_from_free_list(struct page *page, struct zone *zone, unsigned int order) { + /* clear reported state and update reported page count */ + if (page_reported(page)) + __ClearPageReported(page); + list_del(&page->lru); __ClearPageBuddy(page); set_page_private(page, 0); @@ -965,7 +970,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone, static inline void __free_one_page(struct page *page, unsigned long pfn, struct zone *zone, unsigned int order, - int migratetype) + int migratetype, bool report) { struct capture_control *capc = task_capc(zone); unsigned long uninitialized_var(buddy_pfn); @@ -1050,6 +1055,10 @@ static inline void __free_one_page(struct page *page, add_to_free_list_tail(page, zone, order, migratetype); else add_to_free_list(page, zone, order, migratetype); + + /* Notify page reporting subsystem of freed page */ + if (report) + page_reporting_notify_free(order); } /* @@ -1366,7 +1375,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (unlikely(isolated_pageblocks)) mt = get_pageblock_migratetype(page); - __free_one_page(page, page_to_pfn(page), zone, 0, mt); + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); trace_mm_page_pcpu_drain(page, 0, mt); } spin_unlock(&zone->lock); @@ -1382,7 +1391,7 @@ static void free_one_page(struct zone *zone, is_migrate_isolate(migratetype))) { migratetype = get_pfnblock_migratetype(page, pfn); } - __free_one_page(page, pfn, zone, order, migratetype); + __free_one_page(page, pfn, zone, order, migratetype, true); spin_unlock(&zone->lock); } @@ -3233,7 +3242,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt) lockdep_assert_held(&zone->lock); /* Return isolated page to tail of freelist. */ - __free_one_page(page, page_to_pfn(page), zone, order, mt); + __free_one_page(page, page_to_pfn(page), zone, order, mt, false); } /* diff --git a/mm/page_reporting.c b/mm/page_reporting.c new file mode 100644 index 000000000000..1047c6872d4f --- /dev/null +++ b/mm/page_reporting.c @@ -0,0 +1,319 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/mm.h> +#include <linux/mmzone.h> +#include <linux/page_reporting.h> +#include <linux/gfp.h> +#include <linux/export.h> +#include <linux/delay.h> +#include <linux/scatterlist.h> + +#include "page_reporting.h" +#include "internal.h" + +#define PAGE_REPORTING_DELAY (2 * HZ) +static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; + +enum { + PAGE_REPORTING_IDLE = 0, + PAGE_REPORTING_REQUESTED, + PAGE_REPORTING_ACTIVE +}; + +/* request page reporting */ +static void +__page_reporting_request(struct page_reporting_dev_info *prdev) +{ + unsigned int state; + + /* Check to see if we are in desired state */ + state = atomic_read(&prdev->state); + if (state == PAGE_REPORTING_REQUESTED) + return; + + /* + * If reporting is already active there is nothing we need to do. + * Test against 0 as that represents PAGE_REPORTING_IDLE. + */ + state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED); + if (state != PAGE_REPORTING_IDLE) + return; + + /* + * Delay the start of work to allow a sizable queue to build. For + * now we are limiting this to running no more than once every + * couple of seconds. + */ + schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY); +} + +/* notify prdev of free page reporting request */ +void __page_reporting_notify(void) +{ + struct page_reporting_dev_info *prdev; + + /* + * We use RCU to protect the pr_dev_info pointer. In almost all + * cases this should be present, however in the unlikely case of + * a shutdown this will be NULL and we should exit. + */ + rcu_read_lock(); + prdev = rcu_dereference(pr_dev_info); + if (likely(prdev)) + __page_reporting_request(prdev); + + rcu_read_unlock(); +} + +static void +page_reporting_drain(struct page_reporting_dev_info *prdev, + struct scatterlist *sgl, unsigned int nents, bool reported) +{ + struct scatterlist *sg = sgl; + + /* + * Drain the now reported pages back into their respective + * free lists/areas. We assume at least one page is populated. + */ + do { + struct page *page = sg_page(sg); + int mt = get_pageblock_migratetype(page); + unsigned int order = get_order(sg->length); + + __putback_isolated_page(page, order, mt); + + /* If the pages were not reported due to error skip flagging */ + if (!reported) + continue; + + /* + * If page was not comingled with another page we can + * consider the result to be "reported" since the page + * hasn't been modified, otherwise we will need to + * report on the new larger page when we make our way + * up to that higher order. + */ + if (PageBuddy(page) && page_order(page) == order) + __SetPageReported(page); + } while ((sg = sg_next(sg))); + + /* reinitialize scatterlist now that it is empty */ + sg_init_table(sgl, nents); +} + +/* + * The page reporting cycle consists of 4 stages, fill, report, drain, and + * idle. We will cycle through the first 3 stages until we cannot obtain a + * full scatterlist of pages, in that case we will switch to idle. + */ +static int +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, + unsigned int order, unsigned int mt, + struct scatterlist *sgl, unsigned int *offset) +{ + struct free_area *area = &zone->free_area[order]; + struct list_head *list = &area->free_list[mt]; + unsigned int page_len = PAGE_SIZE << order; + struct page *page, *next; + int err = 0; + + /* + * Perform early check, if free area is empty there is + * nothing to process so we can skip this free_list. + */ + if (list_empty(list)) + return err; + + spin_lock_irq(&zone->lock); + + /* loop through free list adding unreported pages to sg list */ + list_for_each_entry_safe(page, next, list, lru) { + /* We are going to skip over the reported pages. */ + if (PageReported(page)) + continue; + + /* Attempt to pull page from list */ + if (!__isolate_free_page(page, order)) + break; + + /* Add page to scatter list */ + --(*offset); + sg_set_page(&sgl[*offset], page, page_len, 0); + + /* If scatterlist isn't full grab more pages */ + if (*offset) + continue; + + /* release lock before waiting on report processing */ + spin_unlock_irq(&zone->lock); + + /* begin processing pages in local list */ + err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY); + + /* reset offset since the full list was reported */ + *offset = PAGE_REPORTING_CAPACITY; + + /* reacquire zone lock and resume processing */ + spin_lock_irq(&zone->lock); + + /* flush reported pages from the sg list */ + page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err); + + /* + * Reset next to first entry, the old next isn't valid + * since we dropped the lock to report the pages + */ + next = list_first_entry(list, struct page, lru); + + /* exit on error */ + if (err) + break; + } + + spin_unlock_irq(&zone->lock); + + return err; +} + +static int +page_reporting_process_zone(struct page_reporting_dev_info *prdev, + struct scatterlist *sgl, struct zone *zone) +{ + unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY; + unsigned long watermark; + int err = 0; + + /* Generate minimum watermark to be able to guarantee progress */ + watermark = low_wmark_pages(zone) + + (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER); + + /* + * Cancel request if insufficient free memory or if we failed + * to allocate page reporting statistics for the zone. + */ + if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA)) + return err; + + /* Process each free list starting from lowest order/mt */ + for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) { + for (mt = 0; mt < MIGRATE_TYPES; mt++) { + /* We do not pull pages from the isolate free list */ + if (is_migrate_isolate(mt)) + continue; + + err = page_reporting_cycle(prdev, zone, order, mt, + sgl, &offset); + if (err) + return err; + } + } + + /* report the leftover pages before going idle */ + leftover = PAGE_REPORTING_CAPACITY - offset; + if (leftover) { + sgl = &sgl[offset]; + err = prdev->report(prdev, sgl, leftover); + + /* flush any remaining pages out from the last report */ + spin_lock_irq(&zone->lock); + page_reporting_drain(prdev, sgl, leftover, !err); + spin_unlock_irq(&zone->lock); + } + + return err; +} + +static void page_reporting_process(struct work_struct *work) +{ + struct delayed_work *d_work = to_delayed_work(work); + struct page_reporting_dev_info *prdev = + container_of(d_work, struct page_reporting_dev_info, work); + int err = 0, state = PAGE_REPORTING_ACTIVE; + struct scatterlist *sgl; + struct zone *zone; + + /* + * Change the state to "Active" so that we can track if there is + * anyone requests page reporting after we complete our pass. If + * the state is not altered by the end of the pass we will switch + * to idle and quit scheduling reporting runs. + */ + atomic_set(&prdev->state, state); + + /* allocate scatterlist to store pages being reported on */ + sgl = kmalloc_array(PAGE_REPORTING_CAPACITY, sizeof(*sgl), GFP_KERNEL); + if (!sgl) + goto err_out; + + sg_init_table(sgl, PAGE_REPORTING_CAPACITY); + + for_each_zone(zone) { + err = page_reporting_process_zone(prdev, sgl, zone); + if (err) + break; + } + + kfree(sgl); +err_out: + /* + * If the state has reverted back to requested then there may be + * additional pages to be processed. We will defer for 2s to allow + * more pages to accumulate. + */ + state = atomic_cmpxchg(&prdev->state, state, PAGE_REPORTING_IDLE); + if (state == PAGE_REPORTING_REQUESTED) + schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY); +} + +static DEFINE_MUTEX(page_reporting_mutex); +DEFINE_STATIC_KEY_FALSE(page_reporting_enabled); + +int page_reporting_register(struct page_reporting_dev_info *prdev) +{ + int err = 0; + + mutex_lock(&page_reporting_mutex); + + /* nothing to do if already in use */ + if (rcu_access_pointer(pr_dev_info)) { + err = -EBUSY; + goto err_out; + } + + /* initialize state and work structures */ + atomic_set(&prdev->state, PAGE_REPORTING_IDLE); + INIT_DELAYED_WORK(&prdev->work, &page_reporting_process); + + /* Begin initial flush of zones */ + __page_reporting_request(prdev); + + /* Assign device to allow notifications */ + rcu_assign_pointer(pr_dev_info, prdev); + + /* enable page reporting notification */ + if (!static_key_enabled(&page_reporting_enabled)) { + static_branch_enable(&page_reporting_enabled); + pr_info("Free page reporting enabled\n"); + } +err_out: + mutex_unlock(&page_reporting_mutex); + + return err; +} +EXPORT_SYMBOL_GPL(page_reporting_register); + +void page_reporting_unregister(struct page_reporting_dev_info *prdev) +{ + mutex_lock(&page_reporting_mutex); + + if (rcu_access_pointer(pr_dev_info) == prdev) { + /* Disable page reporting notification */ + RCU_INIT_POINTER(pr_dev_info, NULL); + synchronize_rcu(); + + /* Flush any existing work, and lock it out */ + cancel_delayed_work_sync(&prdev->work); + } + + mutex_unlock(&page_reporting_mutex); +} +EXPORT_SYMBOL_GPL(page_reporting_unregister); diff --git a/mm/page_reporting.h b/mm/page_reporting.h new file mode 100644 index 000000000000..aa6d37f4dc22 --- /dev/null +++ b/mm/page_reporting.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _MM_PAGE_REPORTING_H +#define _MM_PAGE_REPORTING_H + +#include <linux/mmzone.h> +#include <linux/pageblock-flags.h> +#include <linux/page-isolation.h> +#include <linux/jump_label.h> +#include <linux/slab.h> +#include <asm/pgtable.h> +#include <linux/scatterlist.h> + +#define PAGE_REPORTING_MIN_ORDER pageblock_order + +#ifdef CONFIG_PAGE_REPORTING +DECLARE_STATIC_KEY_FALSE(page_reporting_enabled); +void __page_reporting_notify(void); + +static inline bool page_reported(struct page *page) +{ + return static_branch_unlikely(&page_reporting_enabled) && + PageReported(page); +} + +/** + * page_reporting_notify_free - Free page notification to start page processing + * + * This function is meant to act as a screener for __page_reporting_notify + * which will determine if a give zone has crossed over the high-water mark + * that will justify us beginning page treatment. If we have crossed that + * threshold then it will start the process of pulling some pages and + * placing them in the batch list for treatment. + */ +static inline void page_reporting_notify_free(unsigned int order) +{ + /* Called from hot path in __free_one_page() */ + if (!static_branch_unlikely(&page_reporting_enabled)) + return; + + /* Determine if we have crossed reporting threshold */ + if (order < PAGE_REPORTING_MIN_ORDER) + return; + + /* This will add a few cycles, but should be called infrequently */ + __page_reporting_notify(); +} +#else /* CONFIG_PAGE_REPORTING */ +#define page_reported(_page) false + +static inline void page_reporting_notify_free(unsigned int order) +{ +} +#endif /* CONFIG_PAGE_REPORTING */ +#endif /*_MM_PAGE_REPORTING_H */