Message ID | YOWld9O5CZpzOUKA@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] percpu fixes for v5.14-rc1 | expand |
On Wed, Jul 7, 2021 at 6:00 AM Dennis Zhou <dennis@kernel.org> wrote: > > This is just a single change to fix percpu depopulation. The code relied > on depopulation code written specifically for the free path and relied > on vmalloc to do the tlb flush lazily. As we're modifying the backing > pages during the lifetime of a chunk, we need to also flush the tlb > accordingly. I pulled this, but I ended up unpulling after looking at the fix. The fix may be perfectly correct, but I'm looking at that pcpu_reclaim_populated() function, and I want somebody to explain to me what it's ok to drop and re-take the 'pcpu_lock' and just continue. Because whatever it was protecting is now not protected any more. It *looks* like it's intended to protect the pcpu_chunk_lists[] content, and some other functions that do this look ok. So for example, pcpu_balance_free() at least removes the 'chunk' from the pcpu_chunk_lists[] before it drops the lock and then works on the chunk contents. But pcpu_reclaim_populated() seems to *leave* chunk on the pcpu_chunk_lists[], drop the lock, and then continue to use 'chunk'. That odd "release lock and continue to use the data it's supposed to protect" seems to be pre-existing, but (a) this is the code that caused problems to begin with and (b) it seems to now happen even more. So maybe this code is right. But it looks very odd to me, and I'd like to get more explanations of _why_ it would be ok before I pull this fix, since there seems to be a deeper underlying problem in the code that this tries to fix. Linus
On Wed, Jul 07, 2021 at 11:40:45AM -0700, Linus Torvalds wrote: > On Wed, Jul 7, 2021 at 6:00 AM Dennis Zhou <dennis@kernel.org> wrote: > > > > This is just a single change to fix percpu depopulation. The code relied > > on depopulation code written specifically for the free path and relied > > on vmalloc to do the tlb flush lazily. As we're modifying the backing > > pages during the lifetime of a chunk, we need to also flush the tlb > > accordingly. > > I pulled this, but I ended up unpulling after looking at the fix. > > The fix may be perfectly correct, but I'm looking at that > pcpu_reclaim_populated() function, and I want somebody to explain to > me what it's ok to drop and re-take the 'pcpu_lock' and just continue. > > Because whatever it was protecting is now not protected any more. > > It *looks* like it's intended to protect the pcpu_chunk_lists[] > content, and some other functions that do this look ok. So for > example, pcpu_balance_free() at least removes the 'chunk' from the > pcpu_chunk_lists[] before it drops the lock and then works on the > chunk contents. > > But pcpu_reclaim_populated() seems to *leave* chunk on the > pcpu_chunk_lists[], drop the lock, and then continue to use 'chunk'. > > That odd "release lock and continue to use the data it's supposed to > protect" seems to be pre-existing, but > > (a) this is the code that caused problems to begin with > > and > > (b) it seems to now happen even more. > > So maybe this code is right. But it looks very odd to me, and I'd like > to get more explanations of _why_ it would be ok before I pull this > fix, since there seems to be a deeper underlying problem in the code > that this tries to fix. > Yeah I agree. I think I've inadvertently made this more complex than it needs to be and intend to clean it up. Percpu has a mutex lock and spinlock. The mutex lock is responsible for lifetime of a chunk and correctness of the backing pages, chunk->populated[]. The spinlock protects the other members of the chunk as well as pcpu_chunk_lists[]. The pcpu_balance_workfn() is used to serialize the freeing of chunks and maintenance of a floating # of pages to suffice atomic allocations. This is where depopulation is being bolted onto. It uses the serialization of: mutex_lock(&pcpu_alloc_mutex); pcpu_balance_free() pcpu_reclaim_populated() pcpu_balance_populated() mutex_unlock(&pcpu_alloc_mutex); while holding the mutex lock to know that the chunk we are modifying will not be freed. It's this that makes it okay to just pick up the lock and continue. Depopulation adds complexity to the lifecycle of a chunk through others freeing back percpu allocations. So, unlike pcpu_balance_free(), taking the chunk off of pcpu_chunk_lists[] doesn't guarantee someone else isn't accessing the chunk because they are freeing an object back. To handle this, a notion of isolated chunks is introduced and these chunks become effectively hidden to the allocation path. We are also holding the mutex lock throughout this process which prevents the allocation path from simultaenously adding backing pages to a chunk. I hope this explanation makes sense. Thanks, Dennis
The pull request you sent on Wed, 7 Jul 2021 13:00:39 +0000:
> git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-5.14-fixes
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/20d5e570aee77afa44849dc652ff256290ea978e
Thank you!
diff --git a/mm/percpu-km.c b/mm/percpu-km.c index c9d529dc7651..fe31aa19db81 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -32,6 +32,12 @@ #include <linux/log2.h> +static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk, + int page_start, int page_end) +{ + /* nothing */ +} + static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end, gfp_t gfp) { diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index ee5d89fcd66f..2054c9213c43 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -303,6 +303,9 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, * For each cpu, depopulate and unmap pages [@page_start,@page_end) * from @chunk. * + * Caller is required to call pcpu_post_unmap_tlb_flush() if not returning the + * region back to vmalloc() which will lazily flush the tlb. + * * CONTEXT: * pcpu_alloc_mutex. */ @@ -324,8 +327,6 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, pcpu_unmap_pages(chunk, pages, page_start, page_end); - /* no need to flush tlb, vmalloc will handle it lazily */ - pcpu_free_pages(chunk, pages, page_start, page_end); } diff --git a/mm/percpu.c b/mm/percpu.c index b4cebeca4c0c..7f2e0151c4e2 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1572,6 +1572,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk, * * pcpu_populate_chunk - populate the specified range of a chunk * pcpu_depopulate_chunk - depopulate the specified range of a chunk + * pcpu_post_unmap_tlb_flush - flush tlb for the specified range of a chunk * pcpu_create_chunk - create a new chunk * pcpu_destroy_chunk - destroy a chunk, always preceded by full depop * pcpu_addr_to_page - translate address to physical address @@ -1581,6 +1582,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end, gfp_t gfp); static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end); +static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk, + int page_start, int page_end); static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp); static void pcpu_destroy_chunk(struct pcpu_chunk *chunk); static struct page *pcpu_addr_to_page(void *addr); @@ -2137,11 +2140,12 @@ static void pcpu_reclaim_populated(void) { struct pcpu_chunk *chunk; struct pcpu_block_md *block; + int freed_page_start, freed_page_end; int i, end; + bool reintegrate; lockdep_assert_held(&pcpu_lock); -restart: /* * Once a chunk is isolated to the to_depopulate list, the chunk is no * longer discoverable to allocations whom may populate pages. The only @@ -2157,6 +2161,9 @@ static void pcpu_reclaim_populated(void) * Scan chunk's pages in the reverse order to keep populated * pages close to the beginning of the chunk. */ + freed_page_start = chunk->nr_pages; + freed_page_end = 0; + reintegrate = false; for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) { /* no more work to do */ if (chunk->nr_empty_pop_pages == 0) @@ -2164,8 +2171,8 @@ static void pcpu_reclaim_populated(void) /* reintegrate chunk to prevent atomic alloc failures */ if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH) { - pcpu_reintegrate_chunk(chunk); - goto restart; + reintegrate = true; + goto end_chunk; } /* @@ -2194,16 +2201,29 @@ static void pcpu_reclaim_populated(void) spin_lock_irq(&pcpu_lock); pcpu_chunk_depopulated(chunk, i + 1, end + 1); + freed_page_start = min(freed_page_start, i + 1); + freed_page_end = max(freed_page_end, end + 1); /* reset the range and continue */ end = -1; } - if (chunk->free_bytes == pcpu_unit_size) +end_chunk: + /* batch tlb flush per chunk to amortize cost */ + if (freed_page_start < freed_page_end) { + spin_unlock_irq(&pcpu_lock); + pcpu_post_unmap_tlb_flush(chunk, + freed_page_start, + freed_page_end); + cond_resched(); + spin_lock_irq(&pcpu_lock); + } + + if (reintegrate || chunk->free_bytes == pcpu_unit_size) pcpu_reintegrate_chunk(chunk); else - list_move(&chunk->list, - &pcpu_chunk_lists[pcpu_sidelined_slot]); + list_move_tail(&chunk->list, + &pcpu_chunk_lists[pcpu_sidelined_slot]); } }