Message ID | 20250119103205.2172432-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RESEND] fs: avoid mmap sem relocks when coredumping with many missing pages | expand |
On Sun, Jan 19, 2025 at 11:32:05AM +0100, Mateusz Guzik wrote: > Dumping processes with large allocated and mostly not-faulted areas is > very slow. > > Borrowing a test case from Tavian Barnes: > > int main(void) { > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > printf("%p %m\n", mem); > if (mem != MAP_FAILED) { > mem[0] = 1; > } > abort(); > } > > That's 1TB of almost completely not-populated area. > > On my test box it takes 13-14 seconds to dump. > > The profile shows: > - 99.89% 0.00% a.out > entry_SYSCALL_64_after_hwframe > do_syscall_64 > syscall_exit_to_user_mode > arch_do_signal_or_restart > - get_signal > - 99.89% do_coredump > - 99.88% elf_core_dump > - dump_user_range > - 98.12% get_dump_page > - 64.19% __get_user_pages > - 40.92% gup_vma_lookup > - find_vma > - mt_find > 4.21% __rcu_read_lock > 1.33% __rcu_read_unlock > - 3.14% check_vma_flags > 0.68% vma_is_secretmem > 0.61% __cond_resched > 0.60% vma_pgtable_walk_end > 0.59% vma_pgtable_walk_begin > 0.58% no_page_table > - 15.13% down_read_killable > 0.69% __cond_resched > 13.84% up_read > 0.58% __cond_resched > > Almost 29% of the time is spent relocking the mmap semaphore between > calls to get_dump_page() which find nothing. > > Whacking that results in times of 10 seconds (down from 13-14). > > While here make the thing killable. > > The real problem is the page-sized iteration and the real fix would > patch it up instead. It is left as an exercise for the mm-familiar > reader. > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > --- Seems like a good improvement to me. Let's get it tested.
On Sun, 19 Jan 2025 11:32:05 +0100, Mateusz Guzik wrote: > Dumping processes with large allocated and mostly not-faulted areas is > very slow. > > Borrowing a test case from Tavian Barnes: > > int main(void) { > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > printf("%p %m\n", mem); > if (mem != MAP_FAILED) { > mem[0] = 1; > } > abort(); > } > > [...] Applied to the vfs-6.15.misc branch of the vfs/vfs.git tree. Patches in the vfs-6.15.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs-6.15.misc [1/1] fs: avoid mmap sem relocks when coredumping with many missing pages https://git.kernel.org/vfs/vfs/c/dbdd2935ed48
On Sun 19-01-25 11:32:05, Mateusz Guzik wrote: > Dumping processes with large allocated and mostly not-faulted areas is > very slow. > > Borrowing a test case from Tavian Barnes: > > int main(void) { > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > printf("%p %m\n", mem); > if (mem != MAP_FAILED) { > mem[0] = 1; > } > abort(); > } > > That's 1TB of almost completely not-populated area. > > On my test box it takes 13-14 seconds to dump. > > The profile shows: > - 99.89% 0.00% a.out > entry_SYSCALL_64_after_hwframe > do_syscall_64 > syscall_exit_to_user_mode > arch_do_signal_or_restart > - get_signal > - 99.89% do_coredump > - 99.88% elf_core_dump > - dump_user_range > - 98.12% get_dump_page > - 64.19% __get_user_pages > - 40.92% gup_vma_lookup > - find_vma > - mt_find > 4.21% __rcu_read_lock > 1.33% __rcu_read_unlock > - 3.14% check_vma_flags > 0.68% vma_is_secretmem > 0.61% __cond_resched > 0.60% vma_pgtable_walk_end > 0.59% vma_pgtable_walk_begin > 0.58% no_page_table > - 15.13% down_read_killable > 0.69% __cond_resched > 13.84% up_read > 0.58% __cond_resched > > Almost 29% of the time is spent relocking the mmap semaphore between > calls to get_dump_page() which find nothing. > > Whacking that results in times of 10 seconds (down from 13-14). > > While here make the thing killable. > > The real problem is the page-sized iteration and the real fix would > patch it up instead. It is left as an exercise for the mm-familiar > reader. > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> The patch looks good to me. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> BTW: I don't see how we could fundamentally move away from page-sized iteration because core dumping is "by definition" walking page tables and gathering pages there. But it could certainly be much more efficient if implemented properly (e.g. in the example above we'd see that most of PGD level tables are not even allocated so we could be skipping 1GB ranges of address space in one step). Honza > --- > > Minimally tested, very plausible I missed something. > > sent again because the previous thing has myself in To -- i failed to > fix up the oneliner suggested by lore.kernel.org. it seem the original > got lost. > > arch/arm64/kernel/elfcore.c | 3 ++- > fs/coredump.c | 38 +++++++++++++++++++++++++++++++------ > include/linux/mm.h | 2 +- > mm/gup.c | 5 ++--- > 4 files changed, 37 insertions(+), 11 deletions(-) > > diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c > index 2e94d20c4ac7..b735f4c2fe5e 100644 > --- a/arch/arm64/kernel/elfcore.c > +++ b/arch/arm64/kernel/elfcore.c > @@ -27,9 +27,10 @@ static int mte_dump_tag_range(struct coredump_params *cprm, > int ret = 1; > unsigned long addr; > void *tags = NULL; > + int locked = 0; > > for (addr = start; addr < start + len; addr += PAGE_SIZE) { > - struct page *page = get_dump_page(addr); > + struct page *page = get_dump_page(addr, &locked); > > /* > * get_dump_page() returns NULL when encountering an empty > diff --git a/fs/coredump.c b/fs/coredump.c > index d48edb37bc35..84cf76f0d5b6 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -925,14 +925,23 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start, > { > unsigned long addr; > struct page *dump_page; > + int locked, ret; > > dump_page = dump_page_alloc(); > if (!dump_page) > return 0; > > + ret = 0; > + locked = 0; > for (addr = start; addr < start + len; addr += PAGE_SIZE) { > struct page *page; > > + if (!locked) { > + if (mmap_read_lock_killable(current->mm)) > + goto out; > + locked = 1; > + } > + > /* > * To avoid having to allocate page tables for virtual address > * ranges that have never been used yet, and also to make it > @@ -940,21 +949,38 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start, > * NULL when encountering an empty page table entry that would > * otherwise have been filled with the zero page. > */ > - page = get_dump_page(addr); > + page = get_dump_page(addr, &locked); > if (page) { > + if (locked) { > + mmap_read_unlock(current->mm); > + locked = 0; > + } > int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page)); > put_page(page); > - if (stop) { > - dump_page_free(dump_page); > - return 0; > - } > + if (stop) > + goto out; > } else { > dump_skip(cprm, PAGE_SIZE); > } > + > + if (dump_interrupted()) > + goto out; > + > + if (!need_resched()) > + continue; > + if (locked) { > + mmap_read_unlock(current->mm); > + locked = 0; > + } > cond_resched(); > } > + ret = 1; > +out: > + if (locked) > + mmap_read_unlock(current->mm); > + > dump_page_free(dump_page); > - return 1; > + return ret; > } > #endif > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 75c9b4f46897..7df0d9200d8c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2633,7 +2633,7 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, > struct task_struct *task, bool bypass_rlim); > > struct kvec; > -struct page *get_dump_page(unsigned long addr); > +struct page *get_dump_page(unsigned long addr, int *locked); > > bool folio_mark_dirty(struct folio *folio); > bool folio_mark_dirty_lock(struct folio *folio); > diff --git a/mm/gup.c b/mm/gup.c > index 2304175636df..f3be2aa43543 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2266,13 +2266,12 @@ EXPORT_SYMBOL(fault_in_readable); > * Called without mmap_lock (takes and releases the mmap_lock by itself). > */ > #ifdef CONFIG_ELF_CORE > -struct page *get_dump_page(unsigned long addr) > +struct page *get_dump_page(unsigned long addr, int *locked) > { > struct page *page; > - int locked = 0; > int ret; > > - ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked, > + ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked, > FOLL_FORCE | FOLL_DUMP | FOLL_GET); > return (ret == 1) ? page : NULL; > } > -- > 2.43.0 >
On Mon, Jan 20, 2025 at 5:00 PM Jan Kara <jack@suse.cz> wrote: > BTW: I don't see how we could fundamentally move away from page-sized > iteration because core dumping is "by definition" walking page tables and > gathering pages there. But it could certainly be much more efficient if > implemented properly (e.g. in the example above we'd see that most of PGD > level tables are not even allocated so we could be skipping 1GB ranges of > address space in one step). > I was thinking find first allocated page starting at X, then fill in the gap from the last one as required by the format. Rinse & repeat until the entire vma is covered. Surely finding the next page which is known to be there is going to be cheaper than specifically checking if a given page is there.
On 19.01.25 11:32, Mateusz Guzik wrote: > Dumping processes with large allocated and mostly not-faulted areas is > very slow. > > Borrowing a test case from Tavian Barnes: > > int main(void) { > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > printf("%p %m\n", mem); > if (mem != MAP_FAILED) { > mem[0] = 1; > } > abort(); > } > > That's 1TB of almost completely not-populated area. > > On my test box it takes 13-14 seconds to dump. > > The profile shows: > - 99.89% 0.00% a.out > entry_SYSCALL_64_after_hwframe > do_syscall_64 > syscall_exit_to_user_mode > arch_do_signal_or_restart > - get_signal > - 99.89% do_coredump > - 99.88% elf_core_dump > - dump_user_range > - 98.12% get_dump_page > - 64.19% __get_user_pages > - 40.92% gup_vma_lookup > - find_vma > - mt_find > 4.21% __rcu_read_lock > 1.33% __rcu_read_unlock > - 3.14% check_vma_flags > 0.68% vma_is_secretmem > 0.61% __cond_resched > 0.60% vma_pgtable_walk_end > 0.59% vma_pgtable_walk_begin > 0.58% no_page_table > - 15.13% down_read_killable > 0.69% __cond_resched > 13.84% up_read > 0.58% __cond_resched > > Almost 29% of the time is spent relocking the mmap semaphore between > calls to get_dump_page() which find nothing. > > Whacking that results in times of 10 seconds (down from 13-14). > > While here make the thing killable. > > The real problem is the page-sized iteration and the real fix would > patch it up instead. It is left as an exercise for the mm-familiar > reader. > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > --- > > Minimally tested, very plausible I missed something. > > sent again because the previous thing has myself in To -- i failed to > fix up the oneliner suggested by lore.kernel.org. it seem the original > got lost. > > arch/arm64/kernel/elfcore.c | 3 ++- > fs/coredump.c | 38 +++++++++++++++++++++++++++++++------ > include/linux/mm.h | 2 +- > mm/gup.c | 5 ++--- > 4 files changed, 37 insertions(+), 11 deletions(-) > MM side LGTM Acked-by: David Hildenbrand <david@redhat.com>
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c index 2e94d20c4ac7..b735f4c2fe5e 100644 --- a/arch/arm64/kernel/elfcore.c +++ b/arch/arm64/kernel/elfcore.c @@ -27,9 +27,10 @@ static int mte_dump_tag_range(struct coredump_params *cprm, int ret = 1; unsigned long addr; void *tags = NULL; + int locked = 0; for (addr = start; addr < start + len; addr += PAGE_SIZE) { - struct page *page = get_dump_page(addr); + struct page *page = get_dump_page(addr, &locked); /* * get_dump_page() returns NULL when encountering an empty diff --git a/fs/coredump.c b/fs/coredump.c index d48edb37bc35..84cf76f0d5b6 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -925,14 +925,23 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start, { unsigned long addr; struct page *dump_page; + int locked, ret; dump_page = dump_page_alloc(); if (!dump_page) return 0; + ret = 0; + locked = 0; for (addr = start; addr < start + len; addr += PAGE_SIZE) { struct page *page; + if (!locked) { + if (mmap_read_lock_killable(current->mm)) + goto out; + locked = 1; + } + /* * To avoid having to allocate page tables for virtual address * ranges that have never been used yet, and also to make it @@ -940,21 +949,38 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start, * NULL when encountering an empty page table entry that would * otherwise have been filled with the zero page. */ - page = get_dump_page(addr); + page = get_dump_page(addr, &locked); if (page) { + if (locked) { + mmap_read_unlock(current->mm); + locked = 0; + } int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page)); put_page(page); - if (stop) { - dump_page_free(dump_page); - return 0; - } + if (stop) + goto out; } else { dump_skip(cprm, PAGE_SIZE); } + + if (dump_interrupted()) + goto out; + + if (!need_resched()) + continue; + if (locked) { + mmap_read_unlock(current->mm); + locked = 0; + } cond_resched(); } + ret = 1; +out: + if (locked) + mmap_read_unlock(current->mm); + dump_page_free(dump_page); - return 1; + return ret; } #endif diff --git a/include/linux/mm.h b/include/linux/mm.h index 75c9b4f46897..7df0d9200d8c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2633,7 +2633,7 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim); struct kvec; -struct page *get_dump_page(unsigned long addr); +struct page *get_dump_page(unsigned long addr, int *locked); bool folio_mark_dirty(struct folio *folio); bool folio_mark_dirty_lock(struct folio *folio); diff --git a/mm/gup.c b/mm/gup.c index 2304175636df..f3be2aa43543 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2266,13 +2266,12 @@ EXPORT_SYMBOL(fault_in_readable); * Called without mmap_lock (takes and releases the mmap_lock by itself). */ #ifdef CONFIG_ELF_CORE -struct page *get_dump_page(unsigned long addr) +struct page *get_dump_page(unsigned long addr, int *locked) { struct page *page; - int locked = 0; int ret; - ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked, + ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked, FOLL_FORCE | FOLL_DUMP | FOLL_GET); return (ret == 1) ? page : NULL; }
Dumping processes with large allocated and mostly not-faulted areas is very slow. Borrowing a test case from Tavian Barnes: int main(void) { char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); printf("%p %m\n", mem); if (mem != MAP_FAILED) { mem[0] = 1; } abort(); } That's 1TB of almost completely not-populated area. On my test box it takes 13-14 seconds to dump. The profile shows: - 99.89% 0.00% a.out entry_SYSCALL_64_after_hwframe do_syscall_64 syscall_exit_to_user_mode arch_do_signal_or_restart - get_signal - 99.89% do_coredump - 99.88% elf_core_dump - dump_user_range - 98.12% get_dump_page - 64.19% __get_user_pages - 40.92% gup_vma_lookup - find_vma - mt_find 4.21% __rcu_read_lock 1.33% __rcu_read_unlock - 3.14% check_vma_flags 0.68% vma_is_secretmem 0.61% __cond_resched 0.60% vma_pgtable_walk_end 0.59% vma_pgtable_walk_begin 0.58% no_page_table - 15.13% down_read_killable 0.69% __cond_resched 13.84% up_read 0.58% __cond_resched Almost 29% of the time is spent relocking the mmap semaphore between calls to get_dump_page() which find nothing. Whacking that results in times of 10 seconds (down from 13-14). While here make the thing killable. The real problem is the page-sized iteration and the real fix would patch it up instead. It is left as an exercise for the mm-familiar reader. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- Minimally tested, very plausible I missed something. sent again because the previous thing has myself in To -- i failed to fix up the oneliner suggested by lore.kernel.org. it seem the original got lost. arch/arm64/kernel/elfcore.c | 3 ++- fs/coredump.c | 38 +++++++++++++++++++++++++++++++------ include/linux/mm.h | 2 +- mm/gup.c | 5 ++--- 4 files changed, 37 insertions(+), 11 deletions(-)