diff mbox series

[RESEND] fs: avoid mmap sem relocks when coredumping with many missing pages

Message ID 20250119103205.2172432-1-mjguzik@gmail.com (mailing list archive)
State New
Headers show
Series [RESEND] fs: avoid mmap sem relocks when coredumping with many missing pages | expand

Commit Message

Mateusz Guzik Jan. 19, 2025, 10:32 a.m. UTC
Dumping processes with large allocated and mostly not-faulted areas is
very slow.

Borrowing a test case from Tavian Barnes:

int main(void) {
    char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
            MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
    printf("%p %m\n", mem);
    if (mem != MAP_FAILED) {
            mem[0] = 1;
    }
    abort();
}

That's 1TB of almost completely not-populated area.

On my test box it takes 13-14 seconds to dump.

The profile shows:
-   99.89%     0.00%  a.out
     entry_SYSCALL_64_after_hwframe
     do_syscall_64
     syscall_exit_to_user_mode
     arch_do_signal_or_restart
   - get_signal
      - 99.89% do_coredump
         - 99.88% elf_core_dump
            - dump_user_range
               - 98.12% get_dump_page
                  - 64.19% __get_user_pages
                     - 40.92% gup_vma_lookup
                        - find_vma
                           - mt_find
                                4.21% __rcu_read_lock
                                1.33% __rcu_read_unlock
                     - 3.14% check_vma_flags
                          0.68% vma_is_secretmem
                       0.61% __cond_resched
                       0.60% vma_pgtable_walk_end
                       0.59% vma_pgtable_walk_begin
                       0.58% no_page_table
                  - 15.13% down_read_killable
                       0.69% __cond_resched
                    13.84% up_read
                 0.58% __cond_resched

Almost 29% of the time is spent relocking the mmap semaphore between
calls to get_dump_page() which find nothing.

Whacking that results in times of 10 seconds (down from 13-14).

While here make the thing killable.

The real problem is the page-sized iteration and the real fix would
patch it up instead. It is left as an exercise for the mm-familiar
reader.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

Minimally tested, very plausible I missed something.

sent again because the previous thing has myself in To -- i failed to
fix up the oneliner suggested by lore.kernel.org. it seem the original
got lost.

 arch/arm64/kernel/elfcore.c |  3 ++-
 fs/coredump.c               | 38 +++++++++++++++++++++++++++++++------
 include/linux/mm.h          |  2 +-
 mm/gup.c                    |  5 ++---
 4 files changed, 37 insertions(+), 11 deletions(-)

Comments

Christian Brauner Jan. 20, 2025, 2:48 p.m. UTC | #1
On Sun, Jan 19, 2025 at 11:32:05AM +0100, Mateusz Guzik wrote:
> Dumping processes with large allocated and mostly not-faulted areas is
> very slow.
> 
> Borrowing a test case from Tavian Barnes:
> 
> int main(void) {
>     char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>             MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>     printf("%p %m\n", mem);
>     if (mem != MAP_FAILED) {
>             mem[0] = 1;
>     }
>     abort();
> }
> 
> That's 1TB of almost completely not-populated area.
> 
> On my test box it takes 13-14 seconds to dump.
> 
> The profile shows:
> -   99.89%     0.00%  a.out
>      entry_SYSCALL_64_after_hwframe
>      do_syscall_64
>      syscall_exit_to_user_mode
>      arch_do_signal_or_restart
>    - get_signal
>       - 99.89% do_coredump
>          - 99.88% elf_core_dump
>             - dump_user_range
>                - 98.12% get_dump_page
>                   - 64.19% __get_user_pages
>                      - 40.92% gup_vma_lookup
>                         - find_vma
>                            - mt_find
>                                 4.21% __rcu_read_lock
>                                 1.33% __rcu_read_unlock
>                      - 3.14% check_vma_flags
>                           0.68% vma_is_secretmem
>                        0.61% __cond_resched
>                        0.60% vma_pgtable_walk_end
>                        0.59% vma_pgtable_walk_begin
>                        0.58% no_page_table
>                   - 15.13% down_read_killable
>                        0.69% __cond_resched
>                     13.84% up_read
>                  0.58% __cond_resched
> 
> Almost 29% of the time is spent relocking the mmap semaphore between
> calls to get_dump_page() which find nothing.
> 
> Whacking that results in times of 10 seconds (down from 13-14).
> 
> While here make the thing killable.
> 
> The real problem is the page-sized iteration and the real fix would
> patch it up instead. It is left as an exercise for the mm-familiar
> reader.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---

Seems like a good improvement to me.
Let's get it tested.
Christian Brauner Jan. 20, 2025, 2:51 p.m. UTC | #2
On Sun, 19 Jan 2025 11:32:05 +0100, Mateusz Guzik wrote:
> Dumping processes with large allocated and mostly not-faulted areas is
> very slow.
> 
> Borrowing a test case from Tavian Barnes:
> 
> int main(void) {
>     char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>             MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>     printf("%p %m\n", mem);
>     if (mem != MAP_FAILED) {
>             mem[0] = 1;
>     }
>     abort();
> }
> 
> [...]

Applied to the vfs-6.15.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.misc

[1/1] fs: avoid mmap sem relocks when coredumping with many missing pages
      https://git.kernel.org/vfs/vfs/c/dbdd2935ed48
Jan Kara Jan. 20, 2025, 4 p.m. UTC | #3
On Sun 19-01-25 11:32:05, Mateusz Guzik wrote:
> Dumping processes with large allocated and mostly not-faulted areas is
> very slow.
> 
> Borrowing a test case from Tavian Barnes:
> 
> int main(void) {
>     char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>             MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>     printf("%p %m\n", mem);
>     if (mem != MAP_FAILED) {
>             mem[0] = 1;
>     }
>     abort();
> }
> 
> That's 1TB of almost completely not-populated area.
> 
> On my test box it takes 13-14 seconds to dump.
> 
> The profile shows:
> -   99.89%     0.00%  a.out
>      entry_SYSCALL_64_after_hwframe
>      do_syscall_64
>      syscall_exit_to_user_mode
>      arch_do_signal_or_restart
>    - get_signal
>       - 99.89% do_coredump
>          - 99.88% elf_core_dump
>             - dump_user_range
>                - 98.12% get_dump_page
>                   - 64.19% __get_user_pages
>                      - 40.92% gup_vma_lookup
>                         - find_vma
>                            - mt_find
>                                 4.21% __rcu_read_lock
>                                 1.33% __rcu_read_unlock
>                      - 3.14% check_vma_flags
>                           0.68% vma_is_secretmem
>                        0.61% __cond_resched
>                        0.60% vma_pgtable_walk_end
>                        0.59% vma_pgtable_walk_begin
>                        0.58% no_page_table
>                   - 15.13% down_read_killable
>                        0.69% __cond_resched
>                     13.84% up_read
>                  0.58% __cond_resched
> 
> Almost 29% of the time is spent relocking the mmap semaphore between
> calls to get_dump_page() which find nothing.
> 
> Whacking that results in times of 10 seconds (down from 13-14).
> 
> While here make the thing killable.
> 
> The real problem is the page-sized iteration and the real fix would
> patch it up instead. It is left as an exercise for the mm-familiar
> reader.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

BTW: I don't see how we could fundamentally move away from page-sized
iteration because core dumping is "by definition" walking page tables and
gathering pages there. But it could certainly be much more efficient if
implemented properly (e.g. in the example above we'd see that most of PGD
level tables are not even allocated so we could be skipping 1GB ranges of
address space in one step).

								Honza

> ---
> 
> Minimally tested, very plausible I missed something.
> 
> sent again because the previous thing has myself in To -- i failed to
> fix up the oneliner suggested by lore.kernel.org. it seem the original
> got lost.
> 
>  arch/arm64/kernel/elfcore.c |  3 ++-
>  fs/coredump.c               | 38 +++++++++++++++++++++++++++++++------
>  include/linux/mm.h          |  2 +-
>  mm/gup.c                    |  5 ++---
>  4 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
> index 2e94d20c4ac7..b735f4c2fe5e 100644
> --- a/arch/arm64/kernel/elfcore.c
> +++ b/arch/arm64/kernel/elfcore.c
> @@ -27,9 +27,10 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
>  	int ret = 1;
>  	unsigned long addr;
>  	void *tags = NULL;
> +	int locked = 0;
>  
>  	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
> -		struct page *page = get_dump_page(addr);
> +		struct page *page = get_dump_page(addr, &locked);
>  
>  		/*
>  		 * get_dump_page() returns NULL when encountering an empty
> diff --git a/fs/coredump.c b/fs/coredump.c
> index d48edb37bc35..84cf76f0d5b6 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -925,14 +925,23 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
>  {
>  	unsigned long addr;
>  	struct page *dump_page;
> +	int locked, ret;
>  
>  	dump_page = dump_page_alloc();
>  	if (!dump_page)
>  		return 0;
>  
> +	ret = 0;
> +	locked = 0;
>  	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
>  		struct page *page;
>  
> +		if (!locked) {
> +			if (mmap_read_lock_killable(current->mm))
> +				goto out;
> +			locked = 1;
> +		}
> +
>  		/*
>  		 * To avoid having to allocate page tables for virtual address
>  		 * ranges that have never been used yet, and also to make it
> @@ -940,21 +949,38 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
>  		 * NULL when encountering an empty page table entry that would
>  		 * otherwise have been filled with the zero page.
>  		 */
> -		page = get_dump_page(addr);
> +		page = get_dump_page(addr, &locked);
>  		if (page) {
> +			if (locked) {
> +				mmap_read_unlock(current->mm);
> +				locked = 0;
> +			}
>  			int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page));
>  			put_page(page);
> -			if (stop) {
> -				dump_page_free(dump_page);
> -				return 0;
> -			}
> +			if (stop)
> +				goto out;
>  		} else {
>  			dump_skip(cprm, PAGE_SIZE);
>  		}
> +
> +		if (dump_interrupted())
> +			goto out;
> +
> +		if (!need_resched())
> +			continue;
> +		if (locked) {
> +			mmap_read_unlock(current->mm);
> +			locked = 0;
> +		}
>  		cond_resched();
>  	}
> +	ret = 1;
> +out:
> +	if (locked)
> +		mmap_read_unlock(current->mm);
> +
>  	dump_page_free(dump_page);
> -	return 1;
> +	return ret;
>  }
>  #endif
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 75c9b4f46897..7df0d9200d8c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2633,7 +2633,7 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
>  			struct task_struct *task, bool bypass_rlim);
>  
>  struct kvec;
> -struct page *get_dump_page(unsigned long addr);
> +struct page *get_dump_page(unsigned long addr, int *locked);
>  
>  bool folio_mark_dirty(struct folio *folio);
>  bool folio_mark_dirty_lock(struct folio *folio);
> diff --git a/mm/gup.c b/mm/gup.c
> index 2304175636df..f3be2aa43543 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2266,13 +2266,12 @@ EXPORT_SYMBOL(fault_in_readable);
>   * Called without mmap_lock (takes and releases the mmap_lock by itself).
>   */
>  #ifdef CONFIG_ELF_CORE
> -struct page *get_dump_page(unsigned long addr)
> +struct page *get_dump_page(unsigned long addr, int *locked)
>  {
>  	struct page *page;
> -	int locked = 0;
>  	int ret;
>  
> -	ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked,
> +	ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked,
>  				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>  	return (ret == 1) ? page : NULL;
>  }
> -- 
> 2.43.0
>
Mateusz Guzik Jan. 20, 2025, 4:59 p.m. UTC | #4
On Mon, Jan 20, 2025 at 5:00 PM Jan Kara <jack@suse.cz> wrote:
> BTW: I don't see how we could fundamentally move away from page-sized
> iteration because core dumping is "by definition" walking page tables and
> gathering pages there. But it could certainly be much more efficient if
> implemented properly (e.g. in the example above we'd see that most of PGD
> level tables are not even allocated so we could be skipping 1GB ranges of
> address space in one step).
>

I was thinking find first allocated page starting at X, then fill in
the gap from the last one as required by the format. Rinse & repeat
until the entire vma is covered.

Surely finding the next page which is known to be there is going to be
cheaper than specifically checking if a given page is there.
David Hildenbrand Jan. 20, 2025, 6:59 p.m. UTC | #5
On 19.01.25 11:32, Mateusz Guzik wrote:
> Dumping processes with large allocated and mostly not-faulted areas is
> very slow.
> 
> Borrowing a test case from Tavian Barnes:
> 
> int main(void) {
>      char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>              MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>      printf("%p %m\n", mem);
>      if (mem != MAP_FAILED) {
>              mem[0] = 1;
>      }
>      abort();
> }
> 
> That's 1TB of almost completely not-populated area.
> 
> On my test box it takes 13-14 seconds to dump.
> 
> The profile shows:
> -   99.89%     0.00%  a.out
>       entry_SYSCALL_64_after_hwframe
>       do_syscall_64
>       syscall_exit_to_user_mode
>       arch_do_signal_or_restart
>     - get_signal
>        - 99.89% do_coredump
>           - 99.88% elf_core_dump
>              - dump_user_range
>                 - 98.12% get_dump_page
>                    - 64.19% __get_user_pages
>                       - 40.92% gup_vma_lookup
>                          - find_vma
>                             - mt_find
>                                  4.21% __rcu_read_lock
>                                  1.33% __rcu_read_unlock
>                       - 3.14% check_vma_flags
>                            0.68% vma_is_secretmem
>                         0.61% __cond_resched
>                         0.60% vma_pgtable_walk_end
>                         0.59% vma_pgtable_walk_begin
>                         0.58% no_page_table
>                    - 15.13% down_read_killable
>                         0.69% __cond_resched
>                      13.84% up_read
>                   0.58% __cond_resched
> 
> Almost 29% of the time is spent relocking the mmap semaphore between
> calls to get_dump_page() which find nothing.
> 
> Whacking that results in times of 10 seconds (down from 13-14).
> 
> While here make the thing killable.
> 
> The real problem is the page-sized iteration and the real fix would
> patch it up instead. It is left as an exercise for the mm-familiar
> reader.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> 
> Minimally tested, very plausible I missed something.
> 
> sent again because the previous thing has myself in To -- i failed to
> fix up the oneliner suggested by lore.kernel.org. it seem the original
> got lost.
> 
>   arch/arm64/kernel/elfcore.c |  3 ++-
>   fs/coredump.c               | 38 +++++++++++++++++++++++++++++++------
>   include/linux/mm.h          |  2 +-
>   mm/gup.c                    |  5 ++---
>   4 files changed, 37 insertions(+), 11 deletions(-)
> 

MM side LGTM

Acked-by: David Hildenbrand <david@redhat.com>
diff mbox series

Patch

diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..b735f4c2fe5e 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -27,9 +27,10 @@  static int mte_dump_tag_range(struct coredump_params *cprm,
 	int ret = 1;
 	unsigned long addr;
 	void *tags = NULL;
+	int locked = 0;
 
 	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
-		struct page *page = get_dump_page(addr);
+		struct page *page = get_dump_page(addr, &locked);
 
 		/*
 		 * get_dump_page() returns NULL when encountering an empty
diff --git a/fs/coredump.c b/fs/coredump.c
index d48edb37bc35..84cf76f0d5b6 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -925,14 +925,23 @@  int dump_user_range(struct coredump_params *cprm, unsigned long start,
 {
 	unsigned long addr;
 	struct page *dump_page;
+	int locked, ret;
 
 	dump_page = dump_page_alloc();
 	if (!dump_page)
 		return 0;
 
+	ret = 0;
+	locked = 0;
 	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
 		struct page *page;
 
+		if (!locked) {
+			if (mmap_read_lock_killable(current->mm))
+				goto out;
+			locked = 1;
+		}
+
 		/*
 		 * To avoid having to allocate page tables for virtual address
 		 * ranges that have never been used yet, and also to make it
@@ -940,21 +949,38 @@  int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		 * NULL when encountering an empty page table entry that would
 		 * otherwise have been filled with the zero page.
 		 */
-		page = get_dump_page(addr);
+		page = get_dump_page(addr, &locked);
 		if (page) {
+			if (locked) {
+				mmap_read_unlock(current->mm);
+				locked = 0;
+			}
 			int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page));
 			put_page(page);
-			if (stop) {
-				dump_page_free(dump_page);
-				return 0;
-			}
+			if (stop)
+				goto out;
 		} else {
 			dump_skip(cprm, PAGE_SIZE);
 		}
+
+		if (dump_interrupted())
+			goto out;
+
+		if (!need_resched())
+			continue;
+		if (locked) {
+			mmap_read_unlock(current->mm);
+			locked = 0;
+		}
 		cond_resched();
 	}
+	ret = 1;
+out:
+	if (locked)
+		mmap_read_unlock(current->mm);
+
 	dump_page_free(dump_page);
-	return 1;
+	return ret;
 }
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 75c9b4f46897..7df0d9200d8c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2633,7 +2633,7 @@  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
 
 struct kvec;
-struct page *get_dump_page(unsigned long addr);
+struct page *get_dump_page(unsigned long addr, int *locked);
 
 bool folio_mark_dirty(struct folio *folio);
 bool folio_mark_dirty_lock(struct folio *folio);
diff --git a/mm/gup.c b/mm/gup.c
index 2304175636df..f3be2aa43543 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2266,13 +2266,12 @@  EXPORT_SYMBOL(fault_in_readable);
  * Called without mmap_lock (takes and releases the mmap_lock by itself).
  */
 #ifdef CONFIG_ELF_CORE
-struct page *get_dump_page(unsigned long addr)
+struct page *get_dump_page(unsigned long addr, int *locked)
 {
 	struct page *page;
-	int locked = 0;
 	int ret;
 
-	ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked,
+	ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked,
 				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
 	return (ret == 1) ? page : NULL;
 }