diff mbox series

[GIT,PULL] percpu fixes for v5.14-rc1

Message ID YOWld9O5CZpzOUKA@google.com (mailing list archive)
State New
Headers show
Series [GIT,PULL] percpu fixes for v5.14-rc1 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-5.14-fixes

Commit Message

Dennis Zhou July 7, 2021, 1 p.m. UTC
Hi Linus,

This is just a single change to fix percpu depopulation. The code relied
on depopulation code written specifically for the free path and relied
on vmalloc to do the tlb flush lazily. As we're modifying the backing
pages during the lifetime of a chunk, we need to also flush the tlb
accordingly.

Guenter Roeck reported this issue in [1] on mips. I believe we just
happen to be lucky given the much larger chunk sizes on x86 and
consequently less churning of this memory.

[1] https://lore.kernel.org/lkml/20210702191140.GA3166599@roeck-us.net/

Thanks,
Dennis

The following changes since commit d6b63b5b7d7f363c6a54421533791e9849adf2e0:

  Merge tag 'sound-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound (2021-07-02 15:25:23 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-5.14-fixes

for you to fetch changes up to 93274f1dd6b0a615b299beddf99871fe81f91275:

  percpu: flush tlb in pcpu_reclaim_populated() (2021-07-04 18:30:17 +0000)

----------------------------------------------------------------
Dennis Zhou (1):
      percpu: flush tlb in pcpu_reclaim_populated()

 mm/percpu-km.c |  6 ++++++
 mm/percpu-vm.c |  5 +++--
 mm/percpu.c    | 32 ++++++++++++++++++++++++++------
 3 files changed, 35 insertions(+), 8 deletions(-)

Comments

Linus Torvalds July 7, 2021, 6:40 p.m. UTC | #1
On Wed, Jul 7, 2021 at 6:00 AM Dennis Zhou <dennis@kernel.org> wrote:
>
> This is just a single change to fix percpu depopulation. The code relied
> on depopulation code written specifically for the free path and relied
> on vmalloc to do the tlb flush lazily. As we're modifying the backing
> pages during the lifetime of a chunk, we need to also flush the tlb
> accordingly.

I pulled this, but I ended up unpulling after looking at the fix.

The fix may be perfectly correct, but I'm looking at that
pcpu_reclaim_populated() function, and I want somebody to explain to
me what it's ok to drop and re-take the 'pcpu_lock' and just continue.

Because whatever it was protecting is now not protected any more.

It *looks* like it's intended to protect the pcpu_chunk_lists[]
content, and some other functions that do this look ok. So for
example, pcpu_balance_free() at least removes the 'chunk' from the
pcpu_chunk_lists[] before it drops the lock and then works on the
chunk contents.

But pcpu_reclaim_populated() seems to *leave* chunk on the
pcpu_chunk_lists[], drop the lock, and then continue to use 'chunk'.

That odd "release lock and continue to use the data it's supposed to
protect" seems to be pre-existing, but

 (a) this is the code that caused problems to begin with

and

 (b) it seems to now happen even more.

So maybe this code is right. But it looks very odd to me, and I'd like
to get more explanations of _why_ it would be ok before I pull this
fix, since there seems to be a deeper underlying problem in the code
that this tries to fix.

                 Linus
Dennis Zhou July 7, 2021, 9:06 p.m. UTC | #2
On Wed, Jul 07, 2021 at 11:40:45AM -0700, Linus Torvalds wrote:
> On Wed, Jul 7, 2021 at 6:00 AM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > This is just a single change to fix percpu depopulation. The code relied
> > on depopulation code written specifically for the free path and relied
> > on vmalloc to do the tlb flush lazily. As we're modifying the backing
> > pages during the lifetime of a chunk, we need to also flush the tlb
> > accordingly.
> 
> I pulled this, but I ended up unpulling after looking at the fix.
> 
> The fix may be perfectly correct, but I'm looking at that
> pcpu_reclaim_populated() function, and I want somebody to explain to
> me what it's ok to drop and re-take the 'pcpu_lock' and just continue.
> 
> Because whatever it was protecting is now not protected any more.
> 
> It *looks* like it's intended to protect the pcpu_chunk_lists[]
> content, and some other functions that do this look ok. So for
> example, pcpu_balance_free() at least removes the 'chunk' from the
> pcpu_chunk_lists[] before it drops the lock and then works on the
> chunk contents.
> 
> But pcpu_reclaim_populated() seems to *leave* chunk on the
> pcpu_chunk_lists[], drop the lock, and then continue to use 'chunk'.
> 
> That odd "release lock and continue to use the data it's supposed to
> protect" seems to be pre-existing, but
> 
>  (a) this is the code that caused problems to begin with
> 
> and
> 
>  (b) it seems to now happen even more.
> 
> So maybe this code is right. But it looks very odd to me, and I'd like
> to get more explanations of _why_ it would be ok before I pull this
> fix, since there seems to be a deeper underlying problem in the code
> that this tries to fix.
> 

Yeah I agree. I think I've inadvertently made this more complex than it
needs to be and intend to clean it up.

Percpu has a mutex lock and spinlock. The mutex lock is responsible for
lifetime of a chunk and correctness of the backing pages,
chunk->populated[]. The spinlock protects the other members of the chunk
as well as pcpu_chunk_lists[].

The pcpu_balance_workfn() is used to serialize the freeing of chunks and
maintenance of a floating # of pages to suffice atomic allocations. This
is where depopulation is being bolted onto. It uses the serialization
of:
  mutex_lock(&pcpu_alloc_mutex);
  pcpu_balance_free()
  pcpu_reclaim_populated()
  pcpu_balance_populated()
  mutex_unlock(&pcpu_alloc_mutex);

while holding the mutex lock to know that the chunk we are modifying
will not be freed. It's this that makes it okay to just pick up the lock
and continue.

Depopulation adds complexity to the lifecycle of a chunk through others
freeing back percpu allocations. So, unlike pcpu_balance_free(), taking
the chunk off of pcpu_chunk_lists[] doesn't guarantee someone else isn't
accessing the chunk because they are freeing an object back.

To handle this, a notion of isolated chunks is introduced and these
chunks become effectively hidden to the allocation path. We are also
holding the mutex lock throughout this process which prevents the
allocation path from simultaenously adding backing pages to a chunk.

I hope this explanation makes sense.

Thanks,
Dennis
pr-tracker-bot@kernel.org July 10, 2021, 5:35 p.m. UTC | #3
The pull request you sent on Wed, 7 Jul 2021 13:00:39 +0000:

> git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-5.14-fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/20d5e570aee77afa44849dc652ff256290ea978e

Thank you!
diff mbox series

Patch

diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index c9d529dc7651..fe31aa19db81 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -32,6 +32,12 @@ 
 
 #include <linux/log2.h>
 
+static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
+				      int page_start, int page_end)
+{
+	/* nothing */
+}
+
 static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp)
 {
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index ee5d89fcd66f..2054c9213c43 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -303,6 +303,9 @@  static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
  * For each cpu, depopulate and unmap pages [@page_start,@page_end)
  * from @chunk.
  *
+ * Caller is required to call pcpu_post_unmap_tlb_flush() if not returning the
+ * region back to vmalloc() which will lazily flush the tlb.
+ *
  * CONTEXT:
  * pcpu_alloc_mutex.
  */
@@ -324,8 +327,6 @@  static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 
 	pcpu_unmap_pages(chunk, pages, page_start, page_end);
 
-	/* no need to flush tlb, vmalloc will handle it lazily */
-
 	pcpu_free_pages(chunk, pages, page_start, page_end);
 }
 
diff --git a/mm/percpu.c b/mm/percpu.c
index b4cebeca4c0c..7f2e0151c4e2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1572,6 +1572,7 @@  static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk,
  *
  * pcpu_populate_chunk		- populate the specified range of a chunk
  * pcpu_depopulate_chunk	- depopulate the specified range of a chunk
+ * pcpu_post_unmap_tlb_flush	- flush tlb for the specified range of a chunk
  * pcpu_create_chunk		- create a new chunk
  * pcpu_destroy_chunk		- destroy a chunk, always preceded by full depop
  * pcpu_addr_to_page		- translate address to physical address
@@ -1581,6 +1582,8 @@  static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp);
 static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 				  int page_start, int page_end);
+static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
+				      int page_start, int page_end);
 static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
 static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
 static struct page *pcpu_addr_to_page(void *addr);
@@ -2137,11 +2140,12 @@  static void pcpu_reclaim_populated(void)
 {
 	struct pcpu_chunk *chunk;
 	struct pcpu_block_md *block;
+	int freed_page_start, freed_page_end;
 	int i, end;
+	bool reintegrate;
 
 	lockdep_assert_held(&pcpu_lock);
 
-restart:
 	/*
 	 * Once a chunk is isolated to the to_depopulate list, the chunk is no
 	 * longer discoverable to allocations whom may populate pages.  The only
@@ -2157,6 +2161,9 @@  static void pcpu_reclaim_populated(void)
 		 * Scan chunk's pages in the reverse order to keep populated
 		 * pages close to the beginning of the chunk.
 		 */
+		freed_page_start = chunk->nr_pages;
+		freed_page_end = 0;
+		reintegrate = false;
 		for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
 			/* no more work to do */
 			if (chunk->nr_empty_pop_pages == 0)
@@ -2164,8 +2171,8 @@  static void pcpu_reclaim_populated(void)
 
 			/* reintegrate chunk to prevent atomic alloc failures */
 			if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH) {
-				pcpu_reintegrate_chunk(chunk);
-				goto restart;
+				reintegrate = true;
+				goto end_chunk;
 			}
 
 			/*
@@ -2194,16 +2201,29 @@  static void pcpu_reclaim_populated(void)
 			spin_lock_irq(&pcpu_lock);
 
 			pcpu_chunk_depopulated(chunk, i + 1, end + 1);
+			freed_page_start = min(freed_page_start, i + 1);
+			freed_page_end = max(freed_page_end, end + 1);
 
 			/* reset the range and continue */
 			end = -1;
 		}
 
-		if (chunk->free_bytes == pcpu_unit_size)
+end_chunk:
+		/* batch tlb flush per chunk to amortize cost */
+		if (freed_page_start < freed_page_end) {
+			spin_unlock_irq(&pcpu_lock);
+			pcpu_post_unmap_tlb_flush(chunk,
+						  freed_page_start,
+						  freed_page_end);
+			cond_resched();
+			spin_lock_irq(&pcpu_lock);
+		}
+
+		if (reintegrate || chunk->free_bytes == pcpu_unit_size)
 			pcpu_reintegrate_chunk(chunk);
 		else
-			list_move(&chunk->list,
-				  &pcpu_chunk_lists[pcpu_sidelined_slot]);
+			list_move_tail(&chunk->list,
+				       &pcpu_chunk_lists[pcpu_sidelined_slot]);
 	}
 }