diff mbox series

[v2,4/5] mm: Refector release_pages()

Message ID 20230830095011.1228673-5-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series Optimize mmap_exit for large folios | expand

Commit Message

Ryan Roberts Aug. 30, 2023, 9:50 a.m. UTC
In preparation for implementing folios_put_refs() in the next patch,
refactor release_pages() into a set of helper functions, which can be
reused. The primary difference between release_pages() and
folios_put_refs() is how they iterate over the set of folios. The
per-folio actions are identical.

No functional change intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swap.c | 167 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 97 insertions(+), 70 deletions(-)

Comments

Matthew Wilcox (Oracle) Aug. 30, 2023, 7:11 p.m. UTC | #1
On Wed, Aug 30, 2023 at 10:50:10AM +0100, Ryan Roberts wrote:
> In preparation for implementing folios_put_refs() in the next patch,
> refactor release_pages() into a set of helper functions, which can be
> reused. The primary difference between release_pages() and
> folios_put_refs() is how they iterate over the set of folios. The
> per-folio actions are identical.

As you noted, we have colliding patchsets.  I'm not hugely happy with
how patch 4 turned out, so I thought I'd send some addendum patches to
my RFC series that implement pfn_range_put() (maybe should have been
pfn_ranges_put()?) on top of my patch series.  I think it's a bit nicer,
but not quite as nice as it could be.

I'm thinking about doing ...

void release_unref_folios(struct folio_batch *folios)
{
	struct lruvec *lruvec = NULL;
	unsigned long flags = 0;
	int i;

	for (i = 0; i < folios->nr; i++) {
		struct folio *folio = folios->folios[i];
		free_swap_cache(folio);
		__page_cache_release(folio, &lruvec, &flags);
	}
	mem_cgroup_uncharge_folios(folios);
	free_unref_folios(folios);
}

then this becomes:

void folios_put(struct folio_batch *folios)
{
        int i, j;

        for (i = 0, j = 0; i < folios->nr; i++) {
                struct folio *folio = folios->folios[i];

                if (is_huge_zero_page(&folio->page))
                        continue;
                if (folio_is_zone_device(folio)) {
                       if (put_devmap_managed_page(&folio->page))
                                continue;
                        if (folio_put_testzero(folio))
                                free_zone_device_page(&folio->page);
                        continue;
                }

                if (!folio_put_testzero(folio))
                        continue;
                if (folio_test_hugetlb(folio)) {
                       free_huge_folio(folio);
                        continue;
                }

                if (j != i)
                        folios->folios[j] = folio;
                j++;
        }

        folios->nr = j;
        if (!j)
                return;
	release_unref_folios(folios);
}

and pfn_range_put() also becomes shorter and loses all the lruvec work.

Thoughts?
Matthew Wilcox (Oracle) Aug. 30, 2023, 9:17 p.m. UTC | #2
On Wed, Aug 30, 2023 at 08:11:07PM +0100, Matthew Wilcox wrote:
> I'm thinking about doing ...
> 
> void release_unref_folios(struct folio_batch *folios)
> {
> 	struct lruvec *lruvec = NULL;
> 	unsigned long flags = 0;
> 	int i;
> 
> 	for (i = 0; i < folios->nr; i++) {
> 		struct folio *folio = folios->folios[i];
> 		free_swap_cache(folio);

No, can't do that here.  Swap cache has refs on the folio, so it'll
never trigger.

> 		__page_cache_release(folio, &lruvec, &flags);
> 	}
> 	mem_cgroup_uncharge_folios(folios);
> 	free_unref_folios(folios);
> }
> 
> then this becomes:
> 
> void folios_put(struct folio_batch *folios)
> {
>         int i, j;
> 
>         for (i = 0, j = 0; i < folios->nr; i++) {
>                 struct folio *folio = folios->folios[i];
> 
>                 if (is_huge_zero_page(&folio->page))
>                         continue;
>                 if (folio_is_zone_device(folio)) {
>                        if (put_devmap_managed_page(&folio->page))
>                                 continue;
>                         if (folio_put_testzero(folio))
>                                 free_zone_device_page(&folio->page);
>                         continue;
>                 }

Must go at least here, maybe earlier.

>                 if (!folio_put_testzero(folio))
>                         continue;
>                 if (folio_test_hugetlb(folio)) {
>                        free_huge_folio(folio);
>                         continue;
>                 }
> 
>                 if (j != i)
>                         folios->folios[j] = folio;
>                 j++;
>         }
> 
>         folios->nr = j;
>         if (!j)
>                 return;
> 	release_unref_folios(folios);
> }
> 
> and pfn_range_put() also becomes shorter and loses all the lruvec work.
> 
> Thoughts?
>
Ryan Roberts Aug. 31, 2023, 7:57 p.m. UTC | #3
On 30/08/2023 20:11, Matthew Wilcox wrote:
> On Wed, Aug 30, 2023 at 10:50:10AM +0100, Ryan Roberts wrote:
>> In preparation for implementing folios_put_refs() in the next patch,
>> refactor release_pages() into a set of helper functions, which can be
>> reused. The primary difference between release_pages() and
>> folios_put_refs() is how they iterate over the set of folios. The
>> per-folio actions are identical.
> 
> As you noted, we have colliding patchsets.  I'm not hugely happy with
> how patch 4 turned out, 

Could you describe the issues as you see them? I'm keen not to repeat the same
bad patterns in future.

so I thought I'd send some addendum patches to
> my RFC series that implement pfn_range_put() (maybe should have been
> pfn_ranges_put()?) on top of my patch series.  I think it's a bit nicer,
> but not quite as nice as it could be.
> 
> I'm thinking about doing ...
> 
> void release_unref_folios(struct folio_batch *folios)
> {
> 	struct lruvec *lruvec = NULL;
> 	unsigned long flags = 0;
> 	int i;
> 
> 	for (i = 0; i < folios->nr; i++) {
> 		struct folio *folio = folios->folios[i];
> 		free_swap_cache(folio);

Agree this can't go here - would put it in pfn_range_put(). But would not want
it in folios_put() as you suggeted in the other email - that would surely change
the behaviour of folios_put()?

> 		__page_cache_release(folio, &lruvec, &flags);
> 	}

I think you would need to add:

	if (lruvec)
		unlock_page_lruvec_irqrestore(lruvec, flags);

> 	mem_cgroup_uncharge_folios(folios);
> 	free_unref_folios(folios);
> }
> 
> then this becomes:
> 
> void folios_put(struct folio_batch *folios)
> {
>         int i, j;
> 
>         for (i = 0, j = 0; i < folios->nr; i++) {
>                 struct folio *folio = folios->folios[i];
> 
>                 if (is_huge_zero_page(&folio->page))
>                         continue;
>                 if (folio_is_zone_device(folio)) {
>                        if (put_devmap_managed_page(&folio->page))
>                                 continue;
>                         if (folio_put_testzero(folio))
>                                 free_zone_device_page(&folio->page);
>                         continue;
>                 }
> 
>                 if (!folio_put_testzero(folio))
>                         continue;
>                 if (folio_test_hugetlb(folio)) {
>                        free_huge_folio(folio);
>                         continue;
>                 }
> 
>                 if (j != i)
>                         folios->folios[j] = folio;
>                 j++;
>         }
> 
>         folios->nr = j;
>         if (!j)
>                 return;
> 	release_unref_folios(folios);
> }
> 
> and pfn_range_put() also becomes shorter and loses all the lruvec work.

something like this?

void pfn_range_put(struct pfn_range *range, unsigned int nr)
{
	struct folio_batch folios;
	unsigned int i;

	folio_batch_init(&folios);
	for (i = 0; i < nr; i++) {
		struct folio *folio = pfn_folio(range[i].start);
		unsigned int refs = range[i].end - range[i].start;

		free_swap_cache(folio); // <<<<< HERE? To be equivalent to
					//       free_pages_and_swap_cache()

		if (is_huge_zero_page(&folio->page))
                         continue;

		if (folio_is_zone_device(folio)) {
			if (put_devmap_managed_page_refs(&folio->page, refs))
				continue;
			if (folio_ref_sub_and_test(folio, refs))
				free_zone_device_page(&folio->page);
			continue;
		}

		if (!folio_ref_sub_and_test(folio, refs))
			continue;

		/* hugetlb has its own memcg */
		if (folio_test_hugetlb(folio)) {
			free_huge_folio(folio);
			continue;
		}

		if (folio_batch_add(&folios, folio) == 0)
			release_unref_folios(&folios);
	}

	if (folios.nr)
		release_unref_folios(&folios);
}

> 
> Thoughts?

Looks like its getting there to me. Although the bulk of the logic inside the
loop is still common between folios_put() and pfn_range_put(); perhaps we can
have a common helper for that, which would just need to take (folio, refs)?

So by adding free_swap_cache() above, we can call pfn_range_put() direct and can
remove free_pages_and_swap_cache() entirely.

What's the best way to move forward here? Two options as I see it:

 - I drop patch 4 and create a version of pfn_range_put() that has the same
semantic as above but essentially just copies the old release_pages() (similar
to my v1). That keeps my series independent. Then you can replace with the new
pfn_range_put() as part of your series.

 - We can just hook my patches up to the end of your series and do it all in one go.

Opinion?

Thanks,
Ryan
Matthew Wilcox (Oracle) Sept. 1, 2023, 4:36 a.m. UTC | #4
On Thu, Aug 31, 2023 at 08:57:51PM +0100, Ryan Roberts wrote:
> On 30/08/2023 20:11, Matthew Wilcox wrote:
> > On Wed, Aug 30, 2023 at 10:50:10AM +0100, Ryan Roberts wrote:
> >> In preparation for implementing folios_put_refs() in the next patch,
> >> refactor release_pages() into a set of helper functions, which can be
> >> reused. The primary difference between release_pages() and
> >> folios_put_refs() is how they iterate over the set of folios. The
> >> per-folio actions are identical.
> > 
> > As you noted, we have colliding patchsets.  I'm not hugely happy with
> > how patch 4 turned out, 
> 
> Could you describe the issues as you see them? I'm keen not to repeat the same
> bad patterns in future.

I think you absolutely had the right idea.  I'd've probably done the same
as you in the same circumstances as you.  It's just that I'd done this
patch series getting rid of / simplifying a bunch of the code you were
modifying, and I thought it'd make your 4/5 irrelevant and 5/5 simpler.

> > void release_unref_folios(struct folio_batch *folios)
> > {
> > 	struct lruvec *lruvec = NULL;
> > 	unsigned long flags = 0;
> > 	int i;
> > 
> > 	for (i = 0; i < folios->nr; i++) {
> > 		struct folio *folio = folios->folios[i];
> > 		free_swap_cache(folio);
> 
> Agree this can't go here - would put it in pfn_range_put(). But would not want
> it in folios_put() as you suggeted in the other email - that would surely change
> the behaviour of folios_put()?

yes; perhaps there are times when we want to put a batch of folios and
_don't_ want to rip them out of the swapcache.  I haven't thought about
this in much detail, and we're definitely venturing into areas of the
MM where I get myself in trouble (ie anonymous memory).

> > 		__page_cache_release(folio, &lruvec, &flags);
> > 	}
> 
> I think you would need to add:
> 
> 	if (lruvec)
> 		unlock_page_lruvec_irqrestore(lruvec, flags);

Indeed.

> > 	mem_cgroup_uncharge_folios(folios);
> > 	free_unref_folios(folios);
> > }
> > 
> > then this becomes:
> > 
> > void folios_put(struct folio_batch *folios)
> > {
> >         int i, j;
> > 
> >         for (i = 0, j = 0; i < folios->nr; i++) {
> >                 struct folio *folio = folios->folios[i];
> > 
> >                 if (is_huge_zero_page(&folio->page))
> >                         continue;
> >                 if (folio_is_zone_device(folio)) {
> >                        if (put_devmap_managed_page(&folio->page))
> >                                 continue;
> >                         if (folio_put_testzero(folio))
> >                                 free_zone_device_page(&folio->page);
> >                         continue;
> >                 }
> > 
> >                 if (!folio_put_testzero(folio))
> >                         continue;
> >                 if (folio_test_hugetlb(folio)) {
> >                        free_huge_folio(folio);
> >                         continue;
> >                 }
> > 
> >                 if (j != i)
> >                         folios->folios[j] = folio;
> >                 j++;
> >         }
> > 
> >         folios->nr = j;
> >         if (!j)
> >                 return;
> > 	release_unref_folios(folios);
> > }
> > 
> > and pfn_range_put() also becomes shorter and loses all the lruvec work.
> 
> something like this?
> 
> void pfn_range_put(struct pfn_range *range, unsigned int nr)
> {
> 	struct folio_batch folios;
> 	unsigned int i;
> 
> 	folio_batch_init(&folios);
> 	for (i = 0; i < nr; i++) {
> 		struct folio *folio = pfn_folio(range[i].start);
> 		unsigned int refs = range[i].end - range[i].start;
> 
> 		free_swap_cache(folio); // <<<<< HERE? To be equivalent to
> 					//       free_pages_and_swap_cache()
> 
> 		if (is_huge_zero_page(&folio->page))
>                          continue;
> 
> 		if (folio_is_zone_device(folio)) {
> 			if (put_devmap_managed_page_refs(&folio->page, refs))
> 				continue;
> 			if (folio_ref_sub_and_test(folio, refs))
> 				free_zone_device_page(&folio->page);
> 			continue;
> 		}
> 
> 		if (!folio_ref_sub_and_test(folio, refs))
> 			continue;
> 
> 		/* hugetlb has its own memcg */
> 		if (folio_test_hugetlb(folio)) {
> 			free_huge_folio(folio);
> 			continue;
> 		}
> 
> 		if (folio_batch_add(&folios, folio) == 0)
> 			release_unref_folios(&folios);
> 	}
> 
> 	if (folios.nr)
> 		release_unref_folios(&folios);
> }
> 
> > 
> > Thoughts?
> 
> Looks like its getting there to me. Although the bulk of the logic inside the
> loop is still common between folios_put() and pfn_range_put(); perhaps we can
> have a common helper for that, which would just need to take (folio, refs)?
> 
> So by adding free_swap_cache() above, we can call pfn_range_put() direct and can
> remove free_pages_and_swap_cache() entirely.

Yes.  Maybe this works?  I'd like it to!

> What's the best way to move forward here? Two options as I see it:
> 
>  - I drop patch 4 and create a version of pfn_range_put() that has the same
> semantic as above but essentially just copies the old release_pages() (similar
> to my v1). That keeps my series independent. Then you can replace with the new
> pfn_range_put() as part of your series.
> 
>  - We can just hook my patches up to the end of your series and do it all in one go.
> 
> Opinion?

I'm reluctant to tell you "hey, delay until after this series of mine".
We don't have good data yet on whether my patch series helps or hurts.
Plus I'm about to take a few days off (I'll be back for next Wednesday's
meeting).  I think your first option is better (and do feel free to use
a different name from pfn_range_put()) just to keep us from colliding
in ways that ultimately don't matter.
diff mbox series

Patch

diff --git a/mm/swap.c b/mm/swap.c
index b05cce475202..5d3e35668929 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -945,6 +945,98 @@  void lru_cache_disable(void)
 #endif
 }
 
+struct folios_put_refs_ctx {
+	struct list_head pages_to_free;
+	struct lruvec *lruvec;
+	unsigned long flags;
+	unsigned int lock_batch;
+};
+
+static void __folios_put_refs_init(struct folios_put_refs_ctx *ctx)
+{
+	*ctx = (struct folios_put_refs_ctx) {
+		.pages_to_free = LIST_HEAD_INIT(ctx->pages_to_free),
+		.lruvec = NULL,
+		.flags = 0,
+	};
+}
+
+static void __folios_put_refs_complete(struct folios_put_refs_ctx *ctx)
+{
+	if (ctx->lruvec)
+		unlock_page_lruvec_irqrestore(ctx->lruvec, ctx->flags);
+
+	mem_cgroup_uncharge_list(&ctx->pages_to_free);
+	free_unref_page_list(&ctx->pages_to_free);
+}
+
+static void __folios_put_refs_do_one(struct folios_put_refs_ctx *ctx,
+					struct folio *folio, int refs)
+{
+	/*
+	 * Make sure the IRQ-safe lock-holding time does not get
+	 * excessive with a continuous string of pages from the
+	 * same lruvec. The lock is held only if lruvec != NULL.
+	 */
+	if (ctx->lruvec && ++ctx->lock_batch == SWAP_CLUSTER_MAX) {
+		unlock_page_lruvec_irqrestore(ctx->lruvec, ctx->flags);
+		ctx->lruvec = NULL;
+	}
+
+	if (is_huge_zero_page(&folio->page))
+		return;
+
+	if (folio_is_zone_device(folio)) {
+		if (ctx->lruvec) {
+			unlock_page_lruvec_irqrestore(ctx->lruvec, ctx->flags);
+			ctx->lruvec = NULL;
+		}
+		if (put_devmap_managed_page_refs(&folio->page, refs))
+			return;
+		if (folio_ref_sub_and_test(folio, refs))
+			free_zone_device_page(&folio->page);
+		return;
+	}
+
+	if (!folio_ref_sub_and_test(folio, refs))
+		return;
+
+	if (folio_test_large(folio)) {
+		if (ctx->lruvec) {
+			unlock_page_lruvec_irqrestore(ctx->lruvec, ctx->flags);
+			ctx->lruvec = NULL;
+		}
+		__folio_put_large(folio);
+		return;
+	}
+
+	if (folio_test_lru(folio)) {
+		struct lruvec *prev_lruvec = ctx->lruvec;
+
+		ctx->lruvec = folio_lruvec_relock_irqsave(folio, ctx->lruvec,
+								&ctx->flags);
+		if (prev_lruvec != ctx->lruvec)
+			ctx->lock_batch = 0;
+
+		lruvec_del_folio(ctx->lruvec, folio);
+		__folio_clear_lru_flags(folio);
+	}
+
+	/*
+	 * In rare cases, when truncation or holepunching raced with
+	 * munlock after VM_LOCKED was cleared, Mlocked may still be
+	 * found set here.  This does not indicate a problem, unless
+	 * "unevictable_pgs_cleared" appears worryingly large.
+	 */
+	if (unlikely(folio_test_mlocked(folio))) {
+		__folio_clear_mlocked(folio);
+		zone_stat_sub_folio(folio, NR_MLOCK);
+		count_vm_event(UNEVICTABLE_PGCLEARED);
+	}
+
+	list_add(&folio->lru, &ctx->pages_to_free);
+}
+
 /**
  * release_pages - batched put_page()
  * @arg: array of pages to release
@@ -959,10 +1051,9 @@  void release_pages(release_pages_arg arg, int nr)
 {
 	int i;
 	struct page **pages = arg.pages;
-	LIST_HEAD(pages_to_free);
-	struct lruvec *lruvec = NULL;
-	unsigned long flags = 0;
-	unsigned int lock_batch;
+	struct folios_put_refs_ctx ctx;
+
+	__folios_put_refs_init(&ctx);
 
 	for (i = 0; i < nr; i++) {
 		struct folio *folio;
@@ -970,74 +1061,10 @@  void release_pages(release_pages_arg arg, int nr)
 		/* Turn any of the argument types into a folio */
 		folio = page_folio(pages[i]);
 
-		/*
-		 * Make sure the IRQ-safe lock-holding time does not get
-		 * excessive with a continuous string of pages from the
-		 * same lruvec. The lock is held only if lruvec != NULL.
-		 */
-		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
-			unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = NULL;
-		}
-
-		if (is_huge_zero_page(&folio->page))
-			continue;
-
-		if (folio_is_zone_device(folio)) {
-			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-				lruvec = NULL;
-			}
-			if (put_devmap_managed_page(&folio->page))
-				continue;
-			if (folio_put_testzero(folio))
-				free_zone_device_page(&folio->page);
-			continue;
-		}
-
-		if (!folio_put_testzero(folio))
-			continue;
-
-		if (folio_test_large(folio)) {
-			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-				lruvec = NULL;
-			}
-			__folio_put_large(folio);
-			continue;
-		}
-
-		if (folio_test_lru(folio)) {
-			struct lruvec *prev_lruvec = lruvec;
-
-			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
-									&flags);
-			if (prev_lruvec != lruvec)
-				lock_batch = 0;
-
-			lruvec_del_folio(lruvec, folio);
-			__folio_clear_lru_flags(folio);
-		}
-
-		/*
-		 * In rare cases, when truncation or holepunching raced with
-		 * munlock after VM_LOCKED was cleared, Mlocked may still be
-		 * found set here.  This does not indicate a problem, unless
-		 * "unevictable_pgs_cleared" appears worryingly large.
-		 */
-		if (unlikely(folio_test_mlocked(folio))) {
-			__folio_clear_mlocked(folio);
-			zone_stat_sub_folio(folio, NR_MLOCK);
-			count_vm_event(UNEVICTABLE_PGCLEARED);
-		}
-
-		list_add(&folio->lru, &pages_to_free);
+		__folios_put_refs_do_one(&ctx, folio, 1);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
 
-	mem_cgroup_uncharge_list(&pages_to_free);
-	free_unref_page_list(&pages_to_free);
+	__folios_put_refs_complete(&ctx);
 }
 EXPORT_SYMBOL(release_pages);