diff mbox series

[bpf-next,v5,2/7] mm, bpf: Introduce free_pages_nolock()

Message ID 20250115021746.34691-3-alexei.starovoitov@gmail.com (mailing list archive)
State New
Headers show
Series bpf, mm: Introduce try_alloc_pages() | expand

Commit Message

Alexei Starovoitov Jan. 15, 2025, 2:17 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h      |  1 +
 include/linux/mm_types.h |  4 ++
 include/linux/mmzone.h   |  3 ++
 mm/page_alloc.c          | 79 ++++++++++++++++++++++++++++++++++++----
 4 files changed, 79 insertions(+), 8 deletions(-)

Comments

Vlastimil Babka Jan. 15, 2025, 11:47 a.m. UTC | #1
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.
> 
> Do not use llist unconditionally. BPF maps continuously
> allocate/free, so we cannot unconditionally delay the freeing to
> llist. When the memory becomes free make it available to the
> kernel and BPF users right away if possible, and fallback to
> llist as the last resort.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

With:

> @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
>  }
>  EXPORT_SYMBOL(__free_pages);
>  
> +/*
> + * Can be called while holding raw_spin_lock or from IRQ and NMI,
> + * but only for pages that came from try_alloc_pages():
> + * order <= 3, !folio, etc

I think order > 3 is fine, as !pcp_allowed_order() case is handled too? And
what does "!folio" mean?

> + */
> +void free_pages_nolock(struct page *page, unsigned int order)
> +{
> +	if (put_page_testzero(page))
> +		__free_unref_page(page, order, FPI_TRYLOCK);

Hmm this will reach reset_page_owner() and thus stackdepot so same mental
note as for patch 1.

> +}
> +
>  void free_pages(unsigned long addr, unsigned int order)
>  {
>  	if (addr != 0) {
Alexei Starovoitov Jan. 15, 2025, 11:15 p.m. UTC | #2
On Wed, Jan 15, 2025 at 3:47 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce free_pages_nolock() that can free pages without taking locks.
> > It relies on trylock and can be called from any context.
> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> > it uses lockless link list to stash the pages which will be freed
> > by subsequent free_pages() from good context.
> >
> > Do not use llist unconditionally. BPF maps continuously
> > allocate/free, so we cannot unconditionally delay the freeing to
> > llist. When the memory becomes free make it available to the
> > kernel and BPF users right away if possible, and fallback to
> > llist as the last resort.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> With:
>
> > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
> >  }
> >  EXPORT_SYMBOL(__free_pages);
> >
> > +/*
> > + * Can be called while holding raw_spin_lock or from IRQ and NMI,
> > + * but only for pages that came from try_alloc_pages():
> > + * order <= 3, !folio, etc
>
> I think order > 3 is fine, as !pcp_allowed_order() case is handled too?

try_alloc_page() has:
        if (!pcp_allowed_order(order))
                return NULL;

to make sure it tries pcp first.
bpf has no use for order > 1. Even 3 is overkill,
but it's kinda free to support order <= 3, so why not.

> And
> what does "!folio" mean?

That's what we discussed last year.
__free_pages() has all the extra stuff if (!head) and
support for dropping ref on the middle page.
!folio captures this more broadly.

> > + */
> > +void free_pages_nolock(struct page *page, unsigned int order)
> > +{
> > +     if (put_page_testzero(page))
> > +             __free_unref_page(page, order, FPI_TRYLOCK);
>
> Hmm this will reach reset_page_owner() and thus stackdepot so same mental
> note as for patch 1.

save_stack() has recursions protection already. So should be fine.

> > +}
> > +
> >  void free_pages(unsigned long addr, unsigned int order)
> >  {
> >       if (addr != 0) {
>
Vlastimil Babka Jan. 16, 2025, 8:31 a.m. UTC | #3
On 1/16/25 00:15, Alexei Starovoitov wrote:
> On Wed, Jan 15, 2025 at 3:47 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/15/25 03:17, Alexei Starovoitov wrote:
>> > From: Alexei Starovoitov <ast@kernel.org>
>> >
>> > Introduce free_pages_nolock() that can free pages without taking locks.
>> > It relies on trylock and can be called from any context.
>> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI
>> > it uses lockless link list to stash the pages which will be freed
>> > by subsequent free_pages() from good context.
>> >
>> > Do not use llist unconditionally. BPF maps continuously
>> > allocate/free, so we cannot unconditionally delay the freeing to
>> > llist. When the memory becomes free make it available to the
>> > kernel and BPF users right away if possible, and fallback to
>> > llist as the last resort.
>> >
>> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> With:
>>
>> > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order)
>> >  }
>> >  EXPORT_SYMBOL(__free_pages);
>> >
>> > +/*
>> > + * Can be called while holding raw_spin_lock or from IRQ and NMI,
>> > + * but only for pages that came from try_alloc_pages():
>> > + * order <= 3, !folio, etc
>>
>> I think order > 3 is fine, as !pcp_allowed_order() case is handled too?
> 
> try_alloc_page() has:
>         if (!pcp_allowed_order(order))
>                 return NULL;

Ah ok I missed that the comment describes what pages try_alloc_pages()
produces, not what's accepted here.

> to make sure it tries pcp first.
> bpf has no use for order > 1. Even 3 is overkill,
> but it's kinda free to support order <= 3, so why not.
> 
>> And
>> what does "!folio" mean?
> 
> That's what we discussed last year.
> __free_pages() has all the extra stuff if (!head) and
> support for dropping ref on the middle page.
> !folio captures this more broadly.

Aha! But in that case I realize we're actually wrong. It needs to be a folio
(compound page) if it's order > 0 in order to drop that tricky
!head freeing code. It's order > 0 pages that are not compound, that are
problematic and should be eventually removed from the kernel.

The solution is to add __GFP_COMP in try_alloc_pages_noprof(). This will
have no effect on order-0 pages that BPF uses.

Instead of "!folio" the comment could then say "compound".
Sebastian Andrzej Siewior Jan. 17, 2025, 6:20 p.m. UTC | #4
On 2025-01-14 18:17:41 [-0800], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce free_pages_nolock() that can free pages without taking locks.
> It relies on trylock and can be called from any context.
> Since spin_trylock() cannot be used in RT from hard IRQ or NMI
> it uses lockless link list to stash the pages which will be freed
> by subsequent free_pages() from good context.
> 
> Do not use llist unconditionally. BPF maps continuously
> allocate/free, so we cannot unconditionally delay the freeing to
> llist. When the memory becomes free make it available to the
> kernel and BPF users right away if possible, and fallback to
> llist as the last resort.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 74c2a7af1a77..a9c639e3db91 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,>  static void free_one_page(struct zone *zone, struct page *page,
>  			  unsigned long pfn, unsigned int order,
>  			  fpi_t fpi_flags)
>  {
> +	struct llist_head *llhead;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&zone->lock, flags);
> +	if (!spin_trylock_irqsave(&zone->lock, flags)) {
> +		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
> +			add_page_to_zone_llist(zone, page, order);
> +			return;
> +		}
> +		spin_lock_irqsave(&zone->lock, flags);
> +	}
> +
> +	/* The lock succeeded. Process deferred pages. */
> +	llhead = &zone->trylock_free_pages;
> +	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
Thank you.

> +		struct llist_node *llnode;
> +		struct page *p, *tmp;
> +
> +		llnode = llist_del_all(llhead);
> +		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
> +			unsigned int p_order = p->order;
> +
> +			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> +			__count_vm_events(PGFREE, 1 << p_order);
> +		}
> +	}

Sebastian
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b41bb6e01781..6eba2d80feb8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -391,6 +391,7 @@  __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 	__get_free_pages((gfp_mask) | GFP_DMA, (order))
 
 extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..52547b3e5fd8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@  struct page {
 				/* Or, free page */
 				struct list_head buddy_list;
 				struct list_head pcp_list;
+				struct {
+					struct llist_node pcp_llist;
+					unsigned int order;
+				};
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16..1a854e0a9e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -953,6 +953,9 @@  struct zone {
 	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Pages to be freed when next trylock succeeds */
+	struct llist_head	trylock_free_pages;
+
 	/* Write-intensive fields used by compaction and vmstats. */
 	CACHELINE_PADDING(_pad2_);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 74c2a7af1a77..a9c639e3db91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@  typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK		((__force fpi_t)BIT(2))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1247,13 +1250,44 @@  static void split_large_buddy(struct zone *zone, struct page *page,
 	}
 }
 
+static void add_page_to_zone_llist(struct zone *zone, struct page *page,
+				   unsigned int order)
+{
+	/* Remember the order */
+	page->order = order;
+	/* Add the page to the free list */
+	llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+}
+
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
+	struct llist_head *llhead;
 	unsigned long flags;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+			add_page_to_zone_llist(zone, page, order);
+			return;
+		}
+		spin_lock_irqsave(&zone->lock, flags);
+	}
+
+	/* The lock succeeded. Process deferred pages. */
+	llhead = &zone->trylock_free_pages;
+	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
+		struct llist_node *llnode;
+		struct page *p, *tmp;
+
+		llnode = llist_del_all(llhead);
+		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+			unsigned int p_order = p->order;
+
+			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			__count_vm_events(PGFREE, 1 << p_order);
+		}
+	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2596,7 +2630,7 @@  static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
-				   unsigned int order)
+				   unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
 	int pindex;
@@ -2631,6 +2665,14 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
 		pcp->free_count += (1 << order);
+
+	if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+		/*
+		 * Do not attempt to take a zone lock. Let pcp->count get
+		 * over high mark temporarily.
+		 */
+		return;
+	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2645,7 +2687,8 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 /*
  * Free a pcp page
  */
-void free_unref_page(struct page *page, unsigned int order)
+static void __free_unref_page(struct page *page, unsigned int order,
+			      fpi_t fpi_flags)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2654,7 +2697,7 @@  void free_unref_page(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, fpi_flags);
 		return;
 	}
 
@@ -2671,24 +2714,33 @@  void free_unref_page(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, fpi_flags);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
 	zone = page_zone(page);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) {
+		add_page_to_zone_llist(zone, page, order);
+		return;
+	}
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
 	pcp_trylock_finish(UP_flags);
 }
 
+void free_unref_page(struct page *page, unsigned int order)
+{
+	__free_unref_page(page, order, FPI_NONE);
+}
+
 /*
  * Free a batch of folios
  */
@@ -2777,7 +2829,7 @@  void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				       order, FPI_NONE);
 	}
 
 	if (pcp) {
@@ -4853,6 +4905,17 @@  void __free_pages(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL(__free_pages);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * but only for pages that came from try_alloc_pages():
+ * order <= 3, !folio, etc
+ */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+	if (put_page_testzero(page))
+		__free_unref_page(page, order, FPI_TRYLOCK);
+}
+
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {