diff mbox series

coredump: allow interrupting dumps of large anonymous regions

Message ID 049f0da40ed76d94c419f83dd42deb413d6afb44.1737000287.git.tavianator@tavianator.com (mailing list archive)
State New
Headers show
Series coredump: allow interrupting dumps of large anonymous regions | expand

Commit Message

Tavian Barnes Jan. 16, 2025, 4:05 a.m. UTC
dump_user_range() supports sparse core dumps by skipping anonymous pages
which have not been modified.  If get_dump_page() returns NULL, the page
is skipped rather than written to the core dump with dump_emit_page().

Sadly, dump_emit_page() contains the only check for dump_interrupted(),
so when dumping a very large sparse region, the core dump becomes
effectively uninterruptible.  This can be observed with the following
test program:

    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/mman.h>

    int main(void) {
        char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
                MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
        printf("%p %m\n", mem);
        if (mem != MAP_FAILED) {
                mem[0] = 1;
        }
        abort();
    }

The program allocates 1 TiB of anonymous memory, touches one page of it,
and aborts.  During the core dump, SIGKILL has no effect.  It takes
about 30 seconds to finish the dump, burning 100% CPU.

This issue naturally arises with things like Address Sanitizer, which
allocate a large sparse region of virtual address space for their shadow
memory.

Fix it by checking dump_interrupted() explicitly in dump_user_pages().

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
---
 fs/coredump.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Mateusz Guzik Jan. 16, 2025, 7:46 a.m. UTC | #1
On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> dump_user_range() supports sparse core dumps by skipping anonymous pages
> which have not been modified.  If get_dump_page() returns NULL, the page
> is skipped rather than written to the core dump with dump_emit_page().
> 
> Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> so when dumping a very large sparse region, the core dump becomes
> effectively uninterruptible.  This can be observed with the following
> test program:
> 
>     #include <stdlib.h>
>     #include <stdio.h>
>     #include <sys/mman.h>
> 
>     int main(void) {
>         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>         printf("%p %m\n", mem);
>         if (mem != MAP_FAILED) {
>                 mem[0] = 1;
>         }
>         abort();
>     }
> 
> The program allocates 1 TiB of anonymous memory, touches one page of it,
> and aborts.  During the core dump, SIGKILL has no effect.  It takes
> about 30 seconds to finish the dump, burning 100% CPU.
> 

While the patch makes sense to me, this should not be taking anywhere
near this much time and plausibly after unscrewing it will stop being a
factor.

So I had a look with a profiler:
-   99.89%     0.00%  a.out
     entry_SYSCALL_64_after_hwframe                      
     do_syscall_64                                       
     syscall_exit_to_user_mode                           
     arch_do_signal_or_restart                           
   - get_signal                                          
      - 99.89% do_coredump                               
         - 99.88% elf_core_dump                          
            - dump_user_range                            
               - 98.12% get_dump_page                    
                  - 64.19% __get_user_pages              
                     - 40.92% gup_vma_lookup             
                        - find_vma                       
                           - mt_find                     
                                4.21% __rcu_read_lock    
                                1.33% __rcu_read_unlock  
                     - 3.14% check_vma_flags             
                          0.68% vma_is_secretmem         
                       0.61% __cond_resched              
                       0.60% vma_pgtable_walk_end        
                       0.59% vma_pgtable_walk_begin      
                       0.58% no_page_table               
                  - 15.13% down_read_killable            
                       0.69% __cond_resched              
                    13.84% up_read                       
                 0.58% __cond_resched                    


Almost 29% of time is spent relocking the mmap semaphore in
__get_user_pages. This most likely can operate locklessly in the fast
path. Even if somehow not, chances are the lock can be held across
multiple calls.

mt_find spends most of it's time issuing a rep stos of 48 bytes (would
be faster to rep mov 6 times instead). This is the compiler being nasty,
I'll maybe look into it.

However, I strongly suspect the current iteration method is just slow
due to repeat mt_find calls and The Right Approach(tm) would make this
entire thing finish within miliseconds by iterating the maple tree
instead, but then the mm folk would have to be consulted on how to
approach this and it may be time consuming to implement.

Sorting out relocking should be an easily achievable & measurable win
(no interest on my end).
Jan Kara Jan. 16, 2025, 9:56 a.m. UTC | #2
On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > which have not been modified.  If get_dump_page() returns NULL, the page
> > is skipped rather than written to the core dump with dump_emit_page().
> > 
> > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > so when dumping a very large sparse region, the core dump becomes
> > effectively uninterruptible.  This can be observed with the following
> > test program:
> > 
> >     #include <stdlib.h>
> >     #include <stdio.h>
> >     #include <sys/mman.h>
> > 
> >     int main(void) {
> >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> >         printf("%p %m\n", mem);
> >         if (mem != MAP_FAILED) {
> >                 mem[0] = 1;
> >         }
> >         abort();
> >     }
> > 
> > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > about 30 seconds to finish the dump, burning 100% CPU.
> > 
> 
> While the patch makes sense to me, this should not be taking anywhere
> near this much time and plausibly after unscrewing it will stop being a
> factor.
> 
> So I had a look with a profiler:
> -   99.89%     0.00%  a.out
>      entry_SYSCALL_64_after_hwframe                      
>      do_syscall_64                                       
>      syscall_exit_to_user_mode                           
>      arch_do_signal_or_restart                           
>    - get_signal                                          
>       - 99.89% do_coredump                               
>          - 99.88% elf_core_dump                          
>             - dump_user_range                            
>                - 98.12% get_dump_page                    
>                   - 64.19% __get_user_pages              
>                      - 40.92% gup_vma_lookup             
>                         - find_vma                       
>                            - mt_find                     
>                                 4.21% __rcu_read_lock    
>                                 1.33% __rcu_read_unlock  
>                      - 3.14% check_vma_flags             
>                           0.68% vma_is_secretmem         
>                        0.61% __cond_resched              
>                        0.60% vma_pgtable_walk_end        
>                        0.59% vma_pgtable_walk_begin      
>                        0.58% no_page_table               
>                   - 15.13% down_read_killable            
>                        0.69% __cond_resched              
>                     13.84% up_read                       
>                  0.58% __cond_resched                    
> 
> 
> Almost 29% of time is spent relocking the mmap semaphore in
> __get_user_pages. This most likely can operate locklessly in the fast
> path. Even if somehow not, chances are the lock can be held across
> multiple calls.
> 
> mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> be faster to rep mov 6 times instead). This is the compiler being nasty,
> I'll maybe look into it.
> 
> However, I strongly suspect the current iteration method is just slow
> due to repeat mt_find calls and The Right Approach(tm) would make this
> entire thing finish within miliseconds by iterating the maple tree
> instead, but then the mm folk would have to be consulted on how to
> approach this and it may be time consuming to implement.
> 
> Sorting out relocking should be an easily achievable & measurable win
> (no interest on my end).

As much as I agree the code is dumb, doing what you suggest with mmap_sem
isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
as that will cause lock inversion between mmap_sem and whatever filesystem
locks we have to take. So the fix would have to involve processing larger
batches of address space at once (which should also somewhat amortize the
__get_user_pages() setup costs). Not that hard to do but I wanted to spell
it out in case someone wants to pick up this todo item :)

								Honza
Jan Kara Jan. 16, 2025, 10:05 a.m. UTC | #3
On Wed 15-01-25 23:05:38, Tavian Barnes wrote:
> dump_user_range() supports sparse core dumps by skipping anonymous pages
> which have not been modified.  If get_dump_page() returns NULL, the page
> is skipped rather than written to the core dump with dump_emit_page().
> 
> Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> so when dumping a very large sparse region, the core dump becomes
> effectively uninterruptible.  This can be observed with the following
> test program:
> 
>     #include <stdlib.h>
>     #include <stdio.h>
>     #include <sys/mman.h>
> 
>     int main(void) {
>         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>         printf("%p %m\n", mem);
>         if (mem != MAP_FAILED) {
>                 mem[0] = 1;
>         }
>         abort();
>     }
> 
> The program allocates 1 TiB of anonymous memory, touches one page of it,
> and aborts.  During the core dump, SIGKILL has no effect.  It takes
> about 30 seconds to finish the dump, burning 100% CPU.
> 
> This issue naturally arises with things like Address Sanitizer, which
> allocate a large sparse region of virtual address space for their shadow
> memory.
> 
> Fix it by checking dump_interrupted() explicitly in dump_user_pages().
> 
> Signed-off-by: Tavian Barnes <tavianator@tavianator.com>

Thanks for the patch! The idea looks good to me as a quick fix, one
suggestion for improvement below:

> diff --git a/fs/coredump.c b/fs/coredump.c
> index d48edb37bc35..fd29d3f15f1e 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -950,6 +950,10 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
>  			}
>  		} else {
>  			dump_skip(cprm, PAGE_SIZE);
> +			if (dump_interrupted()) {
> +				dump_page_free(dump_page);
> +				return 0;
> +			}

So rather than doing the check here, I'd do it before cond_resched() below
and remove the check from dump_emit_page(). That way we have the
interruption handling all in one place.

>  		}
>  		cond_resched();
>  	}

Bonus points for unifying the exit paths from the loop (perhaps as a
separate cleanup patch) like:

		if (page)
			ret = dump_emit_page(...)
		else
			dump_skip(...)
		if (dump_interrupted())
			ret = 0;
		if (!ret)
			break;
		cond_resched();
	}
	dump_page_free(dump_page);
	return ret;

								Honza
Mateusz Guzik Jan. 16, 2025, 12:04 p.m. UTC | #4
On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > which have not been modified.  If get_dump_page() returns NULL, the page
> > > is skipped rather than written to the core dump with dump_emit_page().
> > >
> > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > so when dumping a very large sparse region, the core dump becomes
> > > effectively uninterruptible.  This can be observed with the following
> > > test program:
> > >
> > >     #include <stdlib.h>
> > >     #include <stdio.h>
> > >     #include <sys/mman.h>
> > >
> > >     int main(void) {
> > >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > >         printf("%p %m\n", mem);
> > >         if (mem != MAP_FAILED) {
> > >                 mem[0] = 1;
> > >         }
> > >         abort();
> > >     }
> > >
> > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > > about 30 seconds to finish the dump, burning 100% CPU.
> > >
> >
> > While the patch makes sense to me, this should not be taking anywhere
> > near this much time and plausibly after unscrewing it will stop being a
> > factor.
> >
> > So I had a look with a profiler:
> > -   99.89%     0.00%  a.out
> >      entry_SYSCALL_64_after_hwframe
> >      do_syscall_64
> >      syscall_exit_to_user_mode
> >      arch_do_signal_or_restart
> >    - get_signal
> >       - 99.89% do_coredump
> >          - 99.88% elf_core_dump
> >             - dump_user_range
> >                - 98.12% get_dump_page
> >                   - 64.19% __get_user_pages
> >                      - 40.92% gup_vma_lookup
> >                         - find_vma
> >                            - mt_find
> >                                 4.21% __rcu_read_lock
> >                                 1.33% __rcu_read_unlock
> >                      - 3.14% check_vma_flags
> >                           0.68% vma_is_secretmem
> >                        0.61% __cond_resched
> >                        0.60% vma_pgtable_walk_end
> >                        0.59% vma_pgtable_walk_begin
> >                        0.58% no_page_table
> >                   - 15.13% down_read_killable
> >                        0.69% __cond_resched
> >                     13.84% up_read
> >                  0.58% __cond_resched
> >
> >
> > Almost 29% of time is spent relocking the mmap semaphore in
> > __get_user_pages. This most likely can operate locklessly in the fast
> > path. Even if somehow not, chances are the lock can be held across
> > multiple calls.
> >
> > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > I'll maybe look into it.
> >
> > However, I strongly suspect the current iteration method is just slow
> > due to repeat mt_find calls and The Right Approach(tm) would make this
> > entire thing finish within miliseconds by iterating the maple tree
> > instead, but then the mm folk would have to be consulted on how to
> > approach this and it may be time consuming to implement.
> >
> > Sorting out relocking should be an easily achievable & measurable win
> > (no interest on my end).
>
> As much as I agree the code is dumb, doing what you suggest with mmap_sem
> isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> as that will cause lock inversion between mmap_sem and whatever filesystem
> locks we have to take. So the fix would have to involve processing larger
> batches of address space at once (which should also somewhat amortize the
> __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> it out in case someone wants to pick up this todo item :)
>

Is the lock really needed to begin with?

Suppose it is.

In this context there are next to no pages found, but there is a
gazillion relocks as the entire VA is being walked.

Bare minimum patch which would already significantly help would start
with the lock held and only relock if there is a page to dump, should
be very easy to add.

I however vote for someone mm-savvy to point out an easy way (if any)
to just iterate pages which are there instead.
Tavian Barnes Jan. 16, 2025, 5:04 p.m. UTC | #5
On Thu, Jan 16, 2025 at 01:04:42PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > > which have not been modified.  If get_dump_page() returns NULL, the page
> > > > is skipped rather than written to the core dump with dump_emit_page().
> > > >
> > > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > > so when dumping a very large sparse region, the core dump becomes
> > > > effectively uninterruptible.  This can be observed with the following
> > > > test program:
> > > >
> > > >     #include <stdlib.h>
> > > >     #include <stdio.h>
> > > >     #include <sys/mman.h>
> > > >
> > > >     int main(void) {
> > > >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > > >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > > >         printf("%p %m\n", mem);
> > > >         if (mem != MAP_FAILED) {
> > > >                 mem[0] = 1;
> > > >         }
> > > >         abort();
> > > >     }
> > > >
> > > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > > > about 30 seconds to finish the dump, burning 100% CPU.
> > > >
> > >
> > > While the patch makes sense to me, this should not be taking anywhere
> > > near this much time and plausibly after unscrewing it will stop being a
> > > factor.
> > >
> > > So I had a look with a profiler:
> > > -   99.89%     0.00%  a.out
> > >      entry_SYSCALL_64_after_hwframe
> > >      do_syscall_64
> > >      syscall_exit_to_user_mode
> > >      arch_do_signal_or_restart
> > >    - get_signal
> > >       - 99.89% do_coredump
> > >          - 99.88% elf_core_dump
> > >             - dump_user_range
> > >                - 98.12% get_dump_page
> > >                   - 64.19% __get_user_pages
> > >                      - 40.92% gup_vma_lookup
> > >                         - find_vma
> > >                            - mt_find
> > >                                 4.21% __rcu_read_lock
> > >                                 1.33% __rcu_read_unlock
> > >                      - 3.14% check_vma_flags
> > >                           0.68% vma_is_secretmem
> > >                        0.61% __cond_resched
> > >                        0.60% vma_pgtable_walk_end
> > >                        0.59% vma_pgtable_walk_begin
> > >                        0.58% no_page_table
> > >                   - 15.13% down_read_killable
> > >                        0.69% __cond_resched
> > >                     13.84% up_read
> > >                  0.58% __cond_resched
> > >
> > >
> > > Almost 29% of time is spent relocking the mmap semaphore in
> > > __get_user_pages. This most likely can operate locklessly in the fast
> > > path. Even if somehow not, chances are the lock can be held across
> > > multiple calls.
> > >
> > > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > > I'll maybe look into it.
> > >
> > > However, I strongly suspect the current iteration method is just slow
> > > due to repeat mt_find calls and The Right Approach(tm) would make this
> > > entire thing finish within miliseconds by iterating the maple tree
> > > instead, but then the mm folk would have to be consulted on how to
> > > approach this and it may be time consuming to implement.
> > >
> > > Sorting out relocking should be an easily achievable & measurable win
> > > (no interest on my end).
> >
> > As much as I agree the code is dumb, doing what you suggest with mmap_sem
> > isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> > as that will cause lock inversion between mmap_sem and whatever filesystem
> > locks we have to take. So the fix would have to involve processing larger
> > batches of address space at once (which should also somewhat amortize the
> > __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> > it out in case someone wants to pick up this todo item :)
> >
> 
> Is the lock really needed to begin with?
> 
> Suppose it is.
> 
> In this context there are next to no pages found, but there is a
> gazillion relocks as the entire VA is being walked.

Do I understand correctly that all the relocks are to look up the VMA
associated with each address, one page at a time?  That's especially
wasteful as dump_user_range() is called separately for each VMA, so it's
going to find the same VMA every time anyway.

> Bare minimum patch which would already significantly help would start
> with the lock held and only relock if there is a page to dump, should
> be very easy to add.

That seems like a good idea.

> I however vote for someone mm-savvy to point out an easy way (if any)
> to just iterate pages which are there instead.

It seems like some of the <linux/pagewalk.h> APIs might be relevant?
Not sure which one has the right semantics.  Can we just use
folio_walk_start()?

I guess the main complexity is every time we find a page, we have to
stop the walk, unlock mmap_sem, call dump_emit_page(), and restart the
walk from the next address.  Maybe an mm expert can weigh in.

> -- 
> Mateusz Guzik <mjguzik gmail.com>
>
Mateusz Guzik Jan. 19, 2025, 10:28 a.m. UTC | #6
On Thu, Jan 16, 2025 at 6:04 PM Tavian Barnes <tavianator@tavianator.com> wrote:
>
> On Thu, Jan 16, 2025 at 01:04:42PM +0100, Mateusz Guzik wrote:
> > In this context there are next to no pages found, but there is a
> > gazillion relocks as the entire VA is being walked.
>
> Do I understand correctly that all the relocks are to look up the VMA
> associated with each address, one page at a time?  That's especially
> wasteful as dump_user_range() is called separately for each VMA, so it's
> going to find the same VMA every time anyway.
>

it indeed is a loop over vmas, and then over the entire range with
PAGE_SIZE'd steps

> > I however vote for someone mm-savvy to point out an easy way (if any)> > to just iterate pages which are there instead.
>
> It seems like some of the <linux/pagewalk.h> APIs might be relevant?
> Not sure which one has the right semantics.  Can we just use
> folio_walk_start()?
>
> I guess the main complexity is every time we find a page, we have to
> stop the walk, unlock mmap_sem, call dump_emit_page(), and restart the
> walk from the next address.  Maybe an mm expert can weigh in.
>

I don't know the way, based on my epsilon understanding of the area I
*suspect* walking the maple tree would do it.
diff mbox series

Patch

diff --git a/fs/coredump.c b/fs/coredump.c
index d48edb37bc35..fd29d3f15f1e 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -950,6 +950,10 @@  int dump_user_range(struct coredump_params *cprm, unsigned long start,
 			}
 		} else {
 			dump_skip(cprm, PAGE_SIZE);
+			if (dump_interrupted()) {
+				dump_page_free(dump_page);
+				return 0;
+			}
 		}
 		cond_resched();
 	}