Message ID | 87sg1p4h0g.fsf_-_@disp2133 (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | coredump: Limit what can interrupt coredumps | expand |
On Thu, Jun 10, 2021 at 1:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Make the coredump code more robust by explicitly testing for all of > the wakeup conditions the coredump code supports. This prevents > new wakeup conditions from breaking the coredump code, as well > as fixing the current issue. Thanks, applied. And lots of thanks to Olivier who kept debugging the odd coredump behavior he saw. Linus
On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote: > > Olivier Langlois has been struggling with coredumps being incompletely > written in > processes using io_uring. > > Olivier Langlois <olivier@trillion01.com> writes: > > io_uring is a big user of task_work and any event that io_uring made > > a > > task waiting for that occurs during the core dump generation will > > generate a TIF_NOTIFY_SIGNAL. > > > > Here are the detailed steps of the problem: > > 1. io_uring calls vfs_poll() to install a task to a file wait queue > > with io_async_wake() as the wakeup function cb from > > io_arm_poll_handler() > > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL > > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling > > set_notify_signal() > > The coredump code deliberately supports being interrupted by SIGKILL, > and depends upon prepare_signal to filter out all other signals. Now > that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack > in dump_emitted by the coredump code no longer works. > > Make the coredump code more robust by explicitly testing for all of > the wakeup conditions the coredump code supports. This prevents > new wakeup conditions from breaking the coredump code, as well > as fixing the current issue. > > The filesystem code that the coredump code uses already limits > itself to only aborting on fatal_signal_pending. So it should > not develop surprising wake-up reasons either. > > v2: Don't remove the now unnecessary code in prepare_signal. > > Cc: stable@vger.kernel.org > Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") > Reported-by: Olivier Langlois <olivier@trillion01.com> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > fs/coredump.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/coredump.c b/fs/coredump.c > index 2868e3e171ae..c3d8fc14b993 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -519,7 +519,7 @@ static bool dump_interrupted(void) > * but then we need to teach dump_write() to restart and clear > * TIF_SIGPENDING. > */ > - return signal_pending(current); > + return fatal_signal_pending(current) || freezing(current); > } > > static void wait_for_dump_helpers(struct file *file) Tested-by: Olivier Langlois <olivier@trillion01.com>
On 6/12/21 8:36 AM, Olivier Langlois wrote: > On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote: >> >> Olivier Langlois has been struggling with coredumps being incompletely >> written in >> processes using io_uring. >> >> Olivier Langlois <olivier@trillion01.com> writes: >>> io_uring is a big user of task_work and any event that io_uring made >>> a >>> task waiting for that occurs during the core dump generation will >>> generate a TIF_NOTIFY_SIGNAL. >>> >>> Here are the detailed steps of the problem: >>> 1. io_uring calls vfs_poll() to install a task to a file wait queue >>> with io_async_wake() as the wakeup function cb from >>> io_arm_poll_handler() >>> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL >>> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling >>> set_notify_signal() >> >> The coredump code deliberately supports being interrupted by SIGKILL, >> and depends upon prepare_signal to filter out all other signals. Now >> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack >> in dump_emitted by the coredump code no longer works. >> >> Make the coredump code more robust by explicitly testing for all of >> the wakeup conditions the coredump code supports. This prevents >> new wakeup conditions from breaking the coredump code, as well >> as fixing the current issue. >> >> The filesystem code that the coredump code uses already limits >> itself to only aborting on fatal_signal_pending. So it should >> not develop surprising wake-up reasons either. >> >> v2: Don't remove the now unnecessary code in prepare_signal. >> >> Cc: stable@vger.kernel.org >> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >> Reported-by: Olivier Langlois <olivier@trillion01.com> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> fs/coredump.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/coredump.c b/fs/coredump.c >> index 2868e3e171ae..c3d8fc14b993 100644 >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -519,7 +519,7 @@ static bool dump_interrupted(void) >> * but then we need to teach dump_write() to restart and clear >> * TIF_SIGPENDING. >> */ >> - return signal_pending(current); >> + return fatal_signal_pending(current) || freezing(current); >> } >> >> static void wait_for_dump_helpers(struct file *file) > > Tested-by: Olivier Langlois <olivier@trillion01.com> Thanks Olivier and Eric for taking care of this. I've been mostly offline for more than a week, back at it next week.
Eric, et al, sorry for delay, I didn't read emails several days. On 06/10, Eric W. Biederman wrote: > > v2: Don't remove the now unnecessary code in prepare_signal. No, that code is still needed. Otherwise any fatal signal will be turned into SIGKILL. > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -519,7 +519,7 @@ static bool dump_interrupted(void) > * but then we need to teach dump_write() to restart and clear > * TIF_SIGPENDING. > */ > - return signal_pending(current); > + return fatal_signal_pending(current) || freezing(current); > } Well yes, this is what the comment says. But note that there is another reason why dump_interrupted() returns true if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs, perhaps something else... That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the dumping threads? Or perhaps we should change __dump_emit() to clear signal_pending() and restart __kernel_write() if it fails or returns a short write. Otherwise the change above doesn't look like a full fix to me. Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > Eric, et al, sorry for delay, I didn't read emails several days. > > On 06/10, Eric W. Biederman wrote: >> >> v2: Don't remove the now unnecessary code in prepare_signal. > > No, that code is still needed. Otherwise any fatal signal will be > turned into SIGKILL. > >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -519,7 +519,7 @@ static bool dump_interrupted(void) >> * but then we need to teach dump_write() to restart and clear >> * TIF_SIGPENDING. >> */ >> - return signal_pending(current); >> + return fatal_signal_pending(current) || freezing(current); >> } > > > Well yes, this is what the comment says. > > But note that there is another reason why dump_interrupted() returns true > if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs, > perhaps something else... The pipe_write case is a good case to consider. In general filesystems are only allowed to stop if fatal_signal_pending. It is an ancient unix/posix thing. > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the > dumping threads? I would very much like some clarity on TIF_NOTIFY_SIGNAL. At the very least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK. I don't know what it has to do with signals. > Or perhaps we should change __dump_emit() to clear signal_pending() and > restart __kernel_write() if it fails or returns a short write. Given that the code needs to handle pipes that seems a reasonable thing to do. Note. All of the blocking of new signals in prepare_signal is still in place. So only a SIGKILL can come in set TIF_SIGPENDING. So I would say that the current fix is correct (and safe to backport). But still requires some magic in prepare_signal until we do more. I don't understand the logic with well enough of adding work to non-io_uring threads with task_work_add to understand why that happens in the first place. There are a lot of silly corners here. Yes please let's keep on digging. Eric
On 06/14, Eric W. Biederman wrote: > > I would very much like some clarity on TIF_NOTIFY_SIGNAL. At the very > least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK. No, no, no ;) I think that, for example, freezer should be changed to use set_notify_signal() rather than fake_signal_wake_up(). Livepatch. And probably it will have more users. > I don't understand the logic with well enough of adding work to > non-io_uring threads with task_work_add to understand why that happens > in the first place. Same here. Oleg.
Oleg Nesterov <oleg@redhat.com> writes: >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -519,7 +519,7 @@ static bool dump_interrupted(void) >> * but then we need to teach dump_write() to restart and clear >> * TIF_SIGPENDING. >> */ >> - return signal_pending(current); >> + return fatal_signal_pending(current) || freezing(current); >> } > > > Well yes, this is what the comment says. > > But note that there is another reason why dump_interrupted() returns true > if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs, > perhaps something else... > > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the > dumping threads? > > Or perhaps we should change __dump_emit() to clear signal_pending() and > restart __kernel_write() if it fails or returns a short write. > > Otherwise the change above doesn't look like a full fix to me. Agreed. The coredump to a pipe will still be short. That needs something additional. The problem Olivier Langlois <olivier@trillion01.com> reported was core dumps coming up short because TIF_NOTIFY_SIGNAL was being set during a core dump. We can see this with pipe_write returning -ERESTARTSYS on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL is true. Looking further if the thread that is core dumping initiated any io_uring work then io_ring_exit_work will use task_work_add to request that thread clean up it's io_uring state. Perhaps we can put a big comment in dump_emit and if we get back -ERESTARTSYS run tracework_notify_signal. I am not seeing any locks held at that point in the coredump, so it should be safe. The coredump is run inside of file_start_write which is the only potential complication. The code flow is complicated but it looks like the entire point of the exercise is to call io_uring_del_task_file on the originating thread. I suppose that keeps the locking of the xarray in io_uring_task simple. Hmm. All of this comes from io_uring_release. How do we get to io_uring_release? The coredump should be catching everything in exit_mm before exit_files? Confused and hopeful someone can explain to me what is going on, and perhaps simplify it. Eric
On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote: > Oleg Nesterov <oleg@redhat.com> writes: > > > > --- a/fs/coredump.c > > > +++ b/fs/coredump.c > > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void) > > > * but then we need to teach dump_write() to restart and > > > clear > > > * TIF_SIGPENDING. > > > */ > > > - return signal_pending(current); > > > + return fatal_signal_pending(current) || > > > freezing(current); > > > } > > > > > > Well yes, this is what the comment says. > > > > But note that there is another reason why dump_interrupted() > > returns true > > if signal_pending(), it assumes thagt __dump_emit()- > > >__kernel_write() may > > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc > > nfs, > > perhaps something else... > > > > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should > > clear > > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not > > abuse the > > dumping threads? > > > > Or perhaps we should change __dump_emit() to clear signal_pending() > > and > > restart __kernel_write() if it fails or returns a short write. > > > > Otherwise the change above doesn't look like a full fix to me. > > Agreed. The coredump to a pipe will still be short. That needs > something additional. > > The problem Olivier Langlois <olivier@trillion01.com> reported was > core dumps coming up short because TIF_NOTIFY_SIGNAL was being > set during a core dump. > > We can see this with pipe_write returning -ERESTARTSYS > on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL > is true. > Eric, I redid my test but this time instead of dumping directly into a file, I did let the coredump be piped to the systemd coredump module and the coredump generation isn't working as expected when piping. So your code review conclusions are correct.
Olivier Langlois <olivier@trillion01.com> writes: > I redid my test but this time instead of dumping directly into a file, > I did let the coredump be piped to the systemd coredump module and the > coredump generation isn't working as expected when piping. > > So your code review conclusions are correct. Thank you for confirming that. Do you know how your test program is using io_uring? I have been trying to put the pieces together on what io_uring is doing that stops the coredump. The fact that it takes a little while before it kills the coredump is a little puzzling. The code looks like all of the io_uring operations should have been canceled before the coredump starts. Thanks, Eric
On Wed, 2021-06-16 at 15:00 -0500, Eric W. Biederman wrote: > Olivier Langlois <olivier@trillion01.com> writes: > > > I redid my test but this time instead of dumping directly into a > > file, > > I did let the coredump be piped to the systemd coredump module and > > the > > coredump generation isn't working as expected when piping. > > > > So your code review conclusions are correct. > > Thank you for confirming that. > > Do you know how your test program is using io_uring? > > I have been trying to put the pieces together on what io_uring is > doing > that stops the coredump. The fact that it takes a little while > before > it kills the coredump is a little puzzling. The code looks like all > of > the io_uring operations should have been canceled before the coredump > starts. > > With a very simple setup, I guess that this could easily be reproducible. Make a TCP connection with a server that is streaming non-stop data and enter a loop where you keep initiating async OP_IOURING_READ operations on your TCP fd. Once you have that, manually sending a SIG_SEGV is a sure fire way to stumble into the problem. This is how I am testing the patches. IRL, it is possible to call io_uring_enter() to submit operations and return from the syscall without waiting on all events to have completed. Once the process is back in userspace, if it stumble into a bug that triggers a coredump, any remaining pending I/O operations can set TIF_SIGNAL_NOTIFY while the coredump is generated. I have read the part of your previous email where you share the result of your ongoing investigation. I didn't comment as the definitive references in io_uring matters are Jens and Pavel but I am going to share my opinion on the matter. I think that you did put the finger on the code cleaning up the io_uring instance in regards to pending operations. It seems to be io_uring_release() which is probably called from exit_files() which happens to be after the call to exit_mm(). At first, I did entertain the idea of considering if it could be possible to duplicate some of the operations performed by io_uring_release() related to the infamous TIF_SIGNAL_NOTIFY setting into io_uring_files_cancel() which is called before exit_mm(). but the idea is useless as it is not the other threads of the group that are causing the TIF_SIGNAL_NOTIFY problem. It is the thread calling do_coredump() which is done by the signal handing code even before that thread enters do_exit() and start to be cleaned up. That thread when it enters do_coredump() is still fully loaded and operational in terms of io_uring functionality. I guess that this io_uring cancel all pending operations hook would have to be called from do_coredump or from get_signal() but if it is the way to go, I feel that this is a change major enough that wouldn't dare going there without the blessing of the maintainers in cause....
On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote: > Oleg Nesterov <oleg@redhat.com> writes: > > > > --- a/fs/coredump.c > > > +++ b/fs/coredump.c > > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void) > > > * but then we need to teach dump_write() to restart and > > > clear > > > * TIF_SIGPENDING. > > > */ > > > - return signal_pending(current); > > > + return fatal_signal_pending(current) || freezing(current); > > > } > > > > > > Well yes, this is what the comment says. > > > > But note that there is another reason why dump_interrupted() returns > > true > > if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() > > may > > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc > > nfs, > > perhaps something else... > > > > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should > > clear > > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse > > the > > dumping threads? > > > > Or perhaps we should change __dump_emit() to clear signal_pending() > > and > > restart __kernel_write() if it fails or returns a short write. > > > > Otherwise the change above doesn't look like a full fix to me. > > Agreed. The coredump to a pipe will still be short. That needs > something additional. > > The problem Olivier Langlois <olivier@trillion01.com> reported was > core dumps coming up short because TIF_NOTIFY_SIGNAL was being > set during a core dump. > > We can see this with pipe_write returning -ERESTARTSYS > on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL > is true. > > Looking further if the thread that is core dumping initiated > any io_uring work then io_ring_exit_work will use task_work_add > to request that thread clean up it's io_uring state. > > Perhaps we can put a big comment in dump_emit and if we > get back -ERESTARTSYS run tracework_notify_signal. I am not > seeing any locks held at that point in the coredump, so it > should be safe. The coredump is run inside of file_start_write > which is the only potential complication. > > > > The code flow is complicated but it looks like the entire > point of the exercise is to call io_uring_del_task_file > on the originating thread. I suppose that keeps the > locking of the xarray in io_uring_task simple. > > > Hmm. All of this comes from io_uring_release. > How do we get to io_uring_release? The coredump should > be catching everything in exit_mm before exit_files? > > Confused and hopeful someone can explain to me what is going on, > and perhaps simplify it. > > Eric Hi all, I didn't forgot about this remaining issue and I have kept thinking about it on and off. I did try the following on 5.12.19: diff --git a/fs/coredump.c b/fs/coredump.c index 07afb5ddb1c4..614fe7a54c1a 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -41,6 +41,7 @@ #include <linux/fs.h> #include <linux/path.h> #include <linux/timekeeping.h> +#include <linux/io_uring.h> #include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo) need_suid_safe = true; } + io_uring_files_cancel(current->files); + retval = coredump_wait(siginfo->si_signo, &core_state); if (retval < 0) goto fail_creds;
On 8/5/21 9:06 AM, Olivier Langlois wrote: > On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote: >> Oleg Nesterov <oleg@redhat.com> writes: >> >>>> --- a/fs/coredump.c >>>> +++ b/fs/coredump.c >>>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void) >>>> * but then we need to teach dump_write() to restart and >>>> clear >>>> * TIF_SIGPENDING. >>>> */ >>>> - return signal_pending(current); >>>> + return fatal_signal_pending(current) || freezing(current); >>>> } >>> Well yes, this is what the comment says. >>> >>> But note that there is another reason why dump_interrupted() returns >>> true >>> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() >>> may >>> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc >>> nfs, >>> perhaps something else... >>> >>> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should >>> clear >>> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse >>> the >>> dumping threads? >>> >>> Or perhaps we should change __dump_emit() to clear signal_pending() >>> and >>> restart __kernel_write() if it fails or returns a short write. >>> >>> Otherwise the change above doesn't look like a full fix to me. >> Agreed. The coredump to a pipe will still be short. That needs >> something additional. >> >> The problem Olivier Langlois <olivier@trillion01.com> reported was >> core dumps coming up short because TIF_NOTIFY_SIGNAL was being >> set during a core dump. >> >> We can see this with pipe_write returning -ERESTARTSYS >> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL >> is true. >> >> Looking further if the thread that is core dumping initiated >> any io_uring work then io_ring_exit_work will use task_work_add >> to request that thread clean up it's io_uring state. >> >> Perhaps we can put a big comment in dump_emit and if we >> get back -ERESTARTSYS run tracework_notify_signal. I am not >> seeing any locks held at that point in the coredump, so it >> should be safe. The coredump is run inside of file_start_write >> which is the only potential complication. >> >> >> >> The code flow is complicated but it looks like the entire >> point of the exercise is to call io_uring_del_task_file >> on the originating thread. I suppose that keeps the >> locking of the xarray in io_uring_task simple. >> >> >> Hmm. All of this comes from io_uring_release. >> How do we get to io_uring_release? The coredump should >> be catching everything in exit_mm before exit_files? >> >> Confused and hopeful someone can explain to me what is going on, >> and perhaps simplify it. >> >> Eric > Hi all, > > I didn't forgot about this remaining issue and I have kept thinking > about it on and off. > > I did try the following on 5.12.19: > > diff --git a/fs/coredump.c b/fs/coredump.c > index 07afb5ddb1c4..614fe7a54c1a 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -41,6 +41,7 @@ > #include <linux/fs.h> > #include <linux/path.h> > #include <linux/timekeeping.h> > +#include <linux/io_uring.h> > > #include <linux/uaccess.h> > #include <asm/mmu_context.h> > @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo) > need_suid_safe = true; > } > > + io_uring_files_cancel(current->files); > + > retval = coredump_wait(siginfo->si_signo, &core_state); > if (retval < 0) > goto fail_creds; > -- > 2.32.0 > > with my current understanding, io_uring_files_cancel is supposed to > cancel everything that might set the TIF_NOTIFY_SIGNAL. > > I must report that in my testing with generating a core dump through a > pipe with the modif above, I still get truncated core dumps. > > systemd is having a weird error: > [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such > process > > and nothing is captured > > so I have replaced it with a very simple shell: > $ cat /proc/sys/kernel/core_pattern > |/home/lano1106/bin/pipe_core.sh %e %p > > ~/bin $ cat pipe_core.sh > #!/bin/sh > > cat > /home/lano1106/core/core.$1.$2 > > BFD: warning: /home/lano1106/core/core.test.10886 is truncated: > expected core file size >= 24129536, found: 61440 > > I conclude from my attempt that maybe io_uring_files_cancel is not 100% > cleaning everything that it should clean. > > > I just ran into this problem also - coredumps from an io_uring program to a pipe are truncated. But I am using kernel 5.10.57, which does NOT have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") or commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). Kernel 5.4 works though, so I bisected the problem to commit f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup properly") in kernel 5.9. Note that my io_uring program uses only async buffered reads, which may be why this particular commit makes a difference to my program. My io_uring program is a multi-purpose long-running program with many threads. Most threads don't use io_uring but a few of them do. Normally, my core dumps are piped to a program so that they can be compressed before being written to disk, but I can also test writing the core dumps directly to disk. This is what I have found: *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers a coredump, the core file is written correctly, whether it is written to disk or piped to a program, even if another thread is using io_uring at the same time. *) Unpatched 5.10.57: if a thread that uses io_uring triggers a coredump, the core file is truncated, whether written directly to disk or piped to a program. *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring triggers a coredump, and the core is written directly to disk, then it is written correctly. *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring triggers a coredump, and the core is piped to a program, then it is truncated. *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, whether written directly to disk or piped to a program. Tony Battersby Cybernetics
On Tue, 2021-08-10 at 17:48 -0400, Tony Battersby wrote: > > > I just ran into this problem also - coredumps from an io_uring > program > to a pipe are truncated. But I am using kernel 5.10.57, which does > NOT > have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") > or > commit 06af8679449d ("coredump: Limit what can interrupt > coredumps"). > Kernel 5.4 works though, so I bisected the problem to commit > f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup > properly") in kernel 5.9. Note that my io_uring program uses only > async > buffered reads, which may be why this particular commit makes a > difference to my program. > > My io_uring program is a multi-purpose long-running program with many > threads. Most threads don't use io_uring but a few of them do. > Normally, my core dumps are piped to a program so that they can be > compressed before being written to disk, but I can also test writing > the > core dumps directly to disk. This is what I have found: > > *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers > a > coredump, the core file is written correctly, whether it is written > to > disk or piped to a program, even if another thread is using io_uring > at > the same time. > > *) Unpatched 5.10.57: if a thread that uses io_uring triggers a > coredump, the core file is truncated, whether written directly to > disk > or piped to a program. > > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring > triggers a coredump, and the core is written directly to disk, then > it > is written correctly. > > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring > triggers a coredump, and the core is piped to a program, then it is > truncated. > > *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, > whether written directly to disk or piped to a program. > > Tony Battersby > Cybernetics > Tony, this is super interesting details. I'm leaving for few days so I will not be able to look into it until I am back but here is my interpretation of your findings: f38c7e3abfba makes it more likely that your task ends up in a fd read wait queue. Previously the io_uring req queuing was failing and returning EAGAIN. Now it ends up using io uring fast poll. When the core dump gets written through a pipe, pipe_write must block waiting for some event. If the task gets waken up by the io_uring wait queue entry instead, it must somehow make pipe_write fails. So the problem must be a mix of TIF_NOTIFY_SIGNAL and the fact that io_uring wait queue entries aren't cleaned up while doing the core dump. I have a new modif to try out. I'll hopefully be able to submit a patch to fix that once I come back (I cannot do it now or else, I'll never leave ;-))
On 8/10/21 3:48 PM, Tony Battersby wrote: > On 8/5/21 9:06 AM, Olivier Langlois wrote: >> On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote: >>> Oleg Nesterov <oleg@redhat.com> writes: >>> >>>>> --- a/fs/coredump.c >>>>> +++ b/fs/coredump.c >>>>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void) >>>>> * but then we need to teach dump_write() to restart and >>>>> clear >>>>> * TIF_SIGPENDING. >>>>> */ >>>>> - return signal_pending(current); >>>>> + return fatal_signal_pending(current) || freezing(current); >>>>> } >>>> Well yes, this is what the comment says. >>>> >>>> But note that there is another reason why dump_interrupted() returns >>>> true >>>> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() >>>> may >>>> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc >>>> nfs, >>>> perhaps something else... >>>> >>>> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should >>>> clear >>>> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse >>>> the >>>> dumping threads? >>>> >>>> Or perhaps we should change __dump_emit() to clear signal_pending() >>>> and >>>> restart __kernel_write() if it fails or returns a short write. >>>> >>>> Otherwise the change above doesn't look like a full fix to me. >>> Agreed. The coredump to a pipe will still be short. That needs >>> something additional. >>> >>> The problem Olivier Langlois <olivier@trillion01.com> reported was >>> core dumps coming up short because TIF_NOTIFY_SIGNAL was being >>> set during a core dump. >>> >>> We can see this with pipe_write returning -ERESTARTSYS >>> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL >>> is true. >>> >>> Looking further if the thread that is core dumping initiated >>> any io_uring work then io_ring_exit_work will use task_work_add >>> to request that thread clean up it's io_uring state. >>> >>> Perhaps we can put a big comment in dump_emit and if we >>> get back -ERESTARTSYS run tracework_notify_signal. I am not >>> seeing any locks held at that point in the coredump, so it >>> should be safe. The coredump is run inside of file_start_write >>> which is the only potential complication. >>> >>> >>> >>> The code flow is complicated but it looks like the entire >>> point of the exercise is to call io_uring_del_task_file >>> on the originating thread. I suppose that keeps the >>> locking of the xarray in io_uring_task simple. >>> >>> >>> Hmm. All of this comes from io_uring_release. >>> How do we get to io_uring_release? The coredump should >>> be catching everything in exit_mm before exit_files? >>> >>> Confused and hopeful someone can explain to me what is going on, >>> and perhaps simplify it. >>> >>> Eric >> Hi all, >> >> I didn't forgot about this remaining issue and I have kept thinking >> about it on and off. >> >> I did try the following on 5.12.19: >> >> diff --git a/fs/coredump.c b/fs/coredump.c >> index 07afb5ddb1c4..614fe7a54c1a 100644 >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -41,6 +41,7 @@ >> #include <linux/fs.h> >> #include <linux/path.h> >> #include <linux/timekeeping.h> >> +#include <linux/io_uring.h> >> >> #include <linux/uaccess.h> >> #include <asm/mmu_context.h> >> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo) >> need_suid_safe = true; >> } >> >> + io_uring_files_cancel(current->files); >> + >> retval = coredump_wait(siginfo->si_signo, &core_state); >> if (retval < 0) >> goto fail_creds; >> -- >> 2.32.0 >> >> with my current understanding, io_uring_files_cancel is supposed to >> cancel everything that might set the TIF_NOTIFY_SIGNAL. >> >> I must report that in my testing with generating a core dump through a >> pipe with the modif above, I still get truncated core dumps. >> >> systemd is having a weird error: >> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >> process >> >> and nothing is captured >> >> so I have replaced it with a very simple shell: >> $ cat /proc/sys/kernel/core_pattern >> |/home/lano1106/bin/pipe_core.sh %e %p >> >> ~/bin $ cat pipe_core.sh >> #!/bin/sh >> >> cat > /home/lano1106/core/core.$1.$2 >> >> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >> expected core file size >= 24129536, found: 61440 >> >> I conclude from my attempt that maybe io_uring_files_cancel is not 100% >> cleaning everything that it should clean. >> >> >> > I just ran into this problem also - coredumps from an io_uring program > to a pipe are truncated. But I am using kernel 5.10.57, which does NOT > have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") or > commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). > Kernel 5.4 works though, so I bisected the problem to commit > f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup > properly") in kernel 5.9. Note that my io_uring program uses only async > buffered reads, which may be why this particular commit makes a > difference to my program. > > My io_uring program is a multi-purpose long-running program with many > threads. Most threads don't use io_uring but a few of them do. > Normally, my core dumps are piped to a program so that they can be > compressed before being written to disk, but I can also test writing the > core dumps directly to disk. This is what I have found: > > *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers a > coredump, the core file is written correctly, whether it is written to > disk or piped to a program, even if another thread is using io_uring at > the same time. > > *) Unpatched 5.10.57: if a thread that uses io_uring triggers a > coredump, the core file is truncated, whether written directly to disk > or piped to a program. > > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring > triggers a coredump, and the core is written directly to disk, then it > is written correctly. > > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring > triggers a coredump, and the core is piped to a program, then it is > truncated. > > *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, > whether written directly to disk or piped to a program. That is very interesting. Like Olivier mentioned, it's not that actual commit, but rather the change of behavior implemented by it. Before that commit, we'd hit the async workers more often, whereas after we do the correct retry method where it's driven by the wakeup when the page is unlocked. This is purely speculation, but perhaps the fact that the process changes state potentially mid dump is why the dump ends up being truncated? I'd love to dive into this and try and figure it out. Absent a test case, at least the above gives me an idea of what to try out. I'll see if it makes it easier for me to create a case that does result in a truncated core dump.
On 8/11/21 9:55 PM, Jens Axboe wrote: > > That is very interesting. Like Olivier mentioned, it's not that actual > commit, but rather the change of behavior implemented by it. Before that > commit, we'd hit the async workers more often, whereas after we do the > correct retry method where it's driven by the wakeup when the page is > unlocked. This is purely speculation, but perhaps the fact that the > process changes state potentially mid dump is why the dump ends up being > truncated? > > I'd love to dive into this and try and figure it out. Absent a test > case, at least the above gives me an idea of what to try out. I'll see > if it makes it easier for me to create a case that does result in a > truncated core dump. > If it helps, a "good" coredump from my program is about 350 MB compressed down to about 7 MB by bzip2. A truncated coredump varies in size from about 60 KB to about 2 MB before compression. The program that receives the coredump uses bzip2 to compress the data before writing it to disk. Tony
On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: > On 8/10/21 3:48 PM, Tony Battersby wrote: > > On 8/5/21 9:06 AM, Olivier Langlois wrote: > > > > > > Hi all, > > > > > > I didn't forgot about this remaining issue and I have kept thinking > > > about it on and off. > > > > > > I did try the following on 5.12.19: > > > > > > diff --git a/fs/coredump.c b/fs/coredump.c > > > index 07afb5ddb1c4..614fe7a54c1a 100644 > > > --- a/fs/coredump.c > > > +++ b/fs/coredump.c > > > @@ -41,6 +41,7 @@ > > > #include <linux/fs.h> > > > #include <linux/path.h> > > > #include <linux/timekeeping.h> > > > +#include <linux/io_uring.h> > > > > > > #include <linux/uaccess.h> > > > #include <asm/mmu_context.h> > > > @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t > > > *siginfo) > > > need_suid_safe = true; > > > } > > > > > > + io_uring_files_cancel(current->files); > > > + > > > retval = coredump_wait(siginfo->si_signo, &core_state); > > > if (retval < 0) > > > goto fail_creds; > > > -- > > > 2.32.0 > > > > > > with my current understanding, io_uring_files_cancel is supposed to > > > cancel everything that might set the TIF_NOTIFY_SIGNAL. > > > > > > I must report that in my testing with generating a core dump > > > through a > > > pipe with the modif above, I still get truncated core dumps. > > > > > > systemd is having a weird error: > > > [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such > > > process > > > > > > and nothing is captured > > > > > > so I have replaced it with a very simple shell: > > > $ cat /proc/sys/kernel/core_pattern > > > > /home/lano1106/bin/pipe_core.sh %e %p > > > > > > ~/bin $ cat pipe_core.sh > > > #!/bin/sh > > > > > > cat > /home/lano1106/core/core.$1.$2 > > > > > > BFD: warning: /home/lano1106/core/core.test.10886 is truncated: > > > expected core file size >= 24129536, found: 61440 > > > > > > I conclude from my attempt that maybe io_uring_files_cancel is not > > > 100% > > > cleaning everything that it should clean. > > > > > > > > > > > I just ran into this problem also - coredumps from an io_uring > > program > > to a pipe are truncated. But I am using kernel 5.10.57, which does > > NOT > > have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") > > or > > commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). > > Kernel 5.4 works though, so I bisected the problem to commit > > f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup > > properly") in kernel 5.9. Note that my io_uring program uses only > > async > > buffered reads, which may be why this particular commit makes a > > difference to my program. > > > > My io_uring program is a multi-purpose long-running program with many > > threads. Most threads don't use io_uring but a few of them do. > > Normally, my core dumps are piped to a program so that they can be > > compressed before being written to disk, but I can also test writing > > the > > core dumps directly to disk. This is what I have found: > > > > *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers > > a > > coredump, the core file is written correctly, whether it is written > > to > > disk or piped to a program, even if another thread is using io_uring > > at > > the same time. > > > > *) Unpatched 5.10.57: if a thread that uses io_uring triggers a > > coredump, the core file is truncated, whether written directly to > > disk > > or piped to a program. > > > > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring > > triggers a coredump, and the core is written directly to disk, then > > it > > is written correctly. > > > > *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring > > triggers a coredump, and the core is piped to a program, then it is > > truncated. > > > > *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, > > whether written directly to disk or piped to a program. > > That is very interesting. Like Olivier mentioned, it's not that actual > commit, but rather the change of behavior implemented by it. Before > that > commit, we'd hit the async workers more often, whereas after we do the > correct retry method where it's driven by the wakeup when the page is > unlocked. This is purely speculation, but perhaps the fact that the > process changes state potentially mid dump is why the dump ends up > being > truncated? > > I'd love to dive into this and try and figure it out. Absent a test > case, at least the above gives me an idea of what to try out. I'll see > if it makes it easier for me to create a case that does result in a > truncated core dump. > Jens, When I have first encountered the issue, the very first thing that I did try was to create a simple test program that would synthetize the problem. After few time consumming failed attempts, I just gave up the idea and simply settle to my prod program that showcase systematically the problem every time that I kill the process with a SEGV signal. In a nutshell, all the program does is to issue read operations with io_uring on a TCP socket on which there is a constant data stream. Now that I have a better understanding of what is going on, I think that one way that could reproduce the problem consistently could be along those lines: 1. Create a pipe 2. fork a child 3. Initiate a read operation on the pipe with io_uring from the child 4. Let the parent kill its child with a core dump generating signal. 5. Write something in the pipe from the parent so that the io_uring read operation completes while the core dump is generated. I guess that I'll end up doing that if I cannot fix the issue with my current setup but here is what I have attempted so far: 1. Call io_uring_files_cancel from do_coredump 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on returning from io_uring_files_cancel Those attempts didn't work but lurking in the io_uring dev mailing list is starting to pay off. I thought that I did reach the bottom of the rabbit hole in my journey of understanding io_uring but the recent patch set sent by Hao Xu https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t made me realize that I still haven't assimilated all the small io_uring nuances... Here is my feedback. From my casual io_uring code reader point of view, it is not 100% obvious what the difference is between io_uring_files_cancel and io_uring_task_cancel It seems like io_uring_files_cancel is cancelling polls only if they have the REQ_F_INFLIGHT flag set. I have no idea what an inflight request means and why someone would want to call io_uring_files_cancel over io_uring_task_cancel. I guess that if I was to meditate on the question for few hours, I would at some point get some illumination strike me but I believe that it could be a good idea to document in the code those concepts for helping casual readers... Bottomline, I now understand that io_uring_files_cancel does not cancel all the requests. Therefore, without fully understanding what I am doing, I am going to replace my call to io_uring_files_cancel from do_coredump with io_uring_task_cancel and see if this finally fix the issue for good. What I am trying to do is to cancel pending io_uring requests to make sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. Maybe another solution would simply be to modify __dump_emit to make it resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally suggested. or maybe do both... Not sure which approach is best. If someone has an opinion, I would be curious to hear it. Greetings,
On 8/15/21 9:42 PM, Olivier Langlois wrote: [...] > When I have first encountered the issue, the very first thing that I > did try was to create a simple test program that would synthetize the > problem. > > After few time consumming failed attempts, I just gave up the idea and > simply settle to my prod program that showcase systematically the > problem every time that I kill the process with a SEGV signal. > > In a nutshell, all the program does is to issue read operations with > io_uring on a TCP socket on which there is a constant data stream. > > Now that I have a better understanding of what is going on, I think > that one way that could reproduce the problem consistently could be > along those lines: > > 1. Create a pipe > 2. fork a child > 3. Initiate a read operation on the pipe with io_uring from the child > 4. Let the parent kill its child with a core dump generating signal. > 5. Write something in the pipe from the parent so that the io_uring > read operation completes while the core dump is generated. > > I guess that I'll end up doing that if I cannot fix the issue with my > current setup but here is what I have attempted so far: > > 1. Call io_uring_files_cancel from do_coredump > 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on > returning from io_uring_files_cancel > > Those attempts didn't work but lurking in the io_uring dev mailing list > is starting to pay off. I thought that I did reach the bottom of the > rabbit hole in my journey of understanding io_uring but the recent > patch set sent by Hao Xu > > https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t > > made me realize that I still haven't assimilated all the small io_uring > nuances... > > Here is my feedback. From my casual io_uring code reader point of view, > it is not 100% obvious what the difference is between > io_uring_files_cancel and io_uring_task_cancel As you mentioned, io_uring_task_cancel() cancels and waits for all requests submitted by current task, used in exec() and SQPOLL because of potential races. io_uring_task_cancel() cancels only selected ones and io_uring_files_cancel() cancels and waits only some specific requests that we absolutely have to, e.g. in 5.15 it'll be only requests referencing the ring itself. It's used on normal task exit. io_uring_task_cancel() cancels and waits all requests submitted by current task, used on exec() because of races. As you mentioned > > It seems like io_uring_files_cancel is cancelling polls only if they > have the REQ_F_INFLIGHT flag set. > > I have no idea what an inflight request means and why someone would > want to call io_uring_files_cancel over io_uring_task_cancel. > > I guess that if I was to meditate on the question for few hours, I > would at some point get some illumination strike me but I believe that > it could be a good idea to document in the code those concepts for > helping casual readers... > > Bottomline, I now understand that io_uring_files_cancel does not cancel > all the requests. Therefore, without fully understanding what I am > doing, I am going to replace my call to io_uring_files_cancel from > do_coredump with io_uring_task_cancel and see if this finally fix the > issue for good. > > What I am trying to do is to cancel pending io_uring requests to make > sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. > > Maybe another solution would simply be to modify __dump_emit to make it > resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally > suggested. > > or maybe do both... > > Not sure which approach is best. If someone has an opinion, I would be > curious to hear it. > > Greetings, > >
On 8/16/21 2:02 PM, Pavel Begunkov wrote: > On 8/15/21 9:42 PM, Olivier Langlois wrote: > [...] >> When I have first encountered the issue, the very first thing that I >> did try was to create a simple test program that would synthetize the >> problem. >> >> After few time consumming failed attempts, I just gave up the idea and >> simply settle to my prod program that showcase systematically the >> problem every time that I kill the process with a SEGV signal. >> >> In a nutshell, all the program does is to issue read operations with >> io_uring on a TCP socket on which there is a constant data stream. >> >> Now that I have a better understanding of what is going on, I think >> that one way that could reproduce the problem consistently could be >> along those lines: >> >> 1. Create a pipe >> 2. fork a child >> 3. Initiate a read operation on the pipe with io_uring from the child >> 4. Let the parent kill its child with a core dump generating signal. >> 5. Write something in the pipe from the parent so that the io_uring >> read operation completes while the core dump is generated. >> >> I guess that I'll end up doing that if I cannot fix the issue with my >> current setup but here is what I have attempted so far: >> >> 1. Call io_uring_files_cancel from do_coredump >> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on >> returning from io_uring_files_cancel >> >> Those attempts didn't work but lurking in the io_uring dev mailing list >> is starting to pay off. I thought that I did reach the bottom of the >> rabbit hole in my journey of understanding io_uring but the recent >> patch set sent by Hao Xu >> >> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t >> >> made me realize that I still haven't assimilated all the small io_uring >> nuances... >> >> Here is my feedback. From my casual io_uring code reader point of view, >> it is not 100% obvious what the difference is between >> io_uring_files_cancel and io_uring_task_cancel > > As you mentioned, io_uring_task_cancel() cancels and waits for all > requests submitted by current task, used in exec() and SQPOLL because > of potential races. Apologies for this draft rumbling... As you mentioned, io_uring_task_cancel() cancels and waits for all requests submitted by current task, used in exec() and SQPOLL because of potential races. io_uring_task_cancel() cancels only selected ones, e.g. in 5.15 will be only requests operating on io_uring, and used during normal exit. Agree that the names may be not too descriptive. >> >> It seems like io_uring_files_cancel is cancelling polls only if they >> have the REQ_F_INFLIGHT flag set. >> >> I have no idea what an inflight request means and why someone would >> want to call io_uring_files_cancel over io_uring_task_cancel. >> >> I guess that if I was to meditate on the question for few hours, I >> would at some point get some illumination strike me but I believe that >> it could be a good idea to document in the code those concepts for >> helping casual readers... >> >> Bottomline, I now understand that io_uring_files_cancel does not cancel >> all the requests. Therefore, without fully understanding what I am >> doing, I am going to replace my call to io_uring_files_cancel from >> do_coredump with io_uring_task_cancel and see if this finally fix the >> issue for good. >> >> What I am trying to do is to cancel pending io_uring requests to make >> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. >> >> Maybe another solution would simply be to modify __dump_emit to make it >> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally >> suggested. >> >> or maybe do both... >> >> Not sure which approach is best. If someone has an opinion, I would be >> curious to hear it. >> >> Greetings, >> >> >
On 8/15/21 2:42 PM, Olivier Langlois wrote: > On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: >> On 8/10/21 3:48 PM, Tony Battersby wrote: >>> On 8/5/21 9:06 AM, Olivier Langlois wrote: >>>> >>>> Hi all, >>>> >>>> I didn't forgot about this remaining issue and I have kept thinking >>>> about it on and off. >>>> >>>> I did try the following on 5.12.19: >>>> >>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>> index 07afb5ddb1c4..614fe7a54c1a 100644 >>>> --- a/fs/coredump.c >>>> +++ b/fs/coredump.c >>>> @@ -41,6 +41,7 @@ >>>> #include <linux/fs.h> >>>> #include <linux/path.h> >>>> #include <linux/timekeeping.h> >>>> +#include <linux/io_uring.h> >>>> >>>> #include <linux/uaccess.h> >>>> #include <asm/mmu_context.h> >>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t >>>> *siginfo) >>>> need_suid_safe = true; >>>> } >>>> >>>> + io_uring_files_cancel(current->files); >>>> + >>>> retval = coredump_wait(siginfo->si_signo, &core_state); >>>> if (retval < 0) >>>> goto fail_creds; >>>> -- >>>> 2.32.0 >>>> >>>> with my current understanding, io_uring_files_cancel is supposed to >>>> cancel everything that might set the TIF_NOTIFY_SIGNAL. >>>> >>>> I must report that in my testing with generating a core dump >>>> through a >>>> pipe with the modif above, I still get truncated core dumps. >>>> >>>> systemd is having a weird error: >>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >>>> process >>>> >>>> and nothing is captured >>>> >>>> so I have replaced it with a very simple shell: >>>> $ cat /proc/sys/kernel/core_pattern >>>>> /home/lano1106/bin/pipe_core.sh %e %p >>>> >>>> ~/bin $ cat pipe_core.sh >>>> #!/bin/sh >>>> >>>> cat > /home/lano1106/core/core.$1.$2 >>>> >>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >>>> expected core file size >= 24129536, found: 61440 >>>> >>>> I conclude from my attempt that maybe io_uring_files_cancel is not >>>> 100% >>>> cleaning everything that it should clean. >>>> >>>> >>>> >>> I just ran into this problem also - coredumps from an io_uring >>> program >>> to a pipe are truncated. But I am using kernel 5.10.57, which does >>> NOT >>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >>> or >>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). >>> Kernel 5.4 works though, so I bisected the problem to commit >>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup >>> properly") in kernel 5.9. Note that my io_uring program uses only >>> async >>> buffered reads, which may be why this particular commit makes a >>> difference to my program. >>> >>> My io_uring program is a multi-purpose long-running program with many >>> threads. Most threads don't use io_uring but a few of them do. >>> Normally, my core dumps are piped to a program so that they can be >>> compressed before being written to disk, but I can also test writing >>> the >>> core dumps directly to disk. This is what I have found: >>> >>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers >>> a >>> coredump, the core file is written correctly, whether it is written >>> to >>> disk or piped to a program, even if another thread is using io_uring >>> at >>> the same time. >>> >>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a >>> coredump, the core file is truncated, whether written directly to >>> disk >>> or piped to a program. >>> >>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>> triggers a coredump, and the core is written directly to disk, then >>> it >>> is written correctly. >>> >>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>> triggers a coredump, and the core is piped to a program, then it is >>> truncated. >>> >>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, >>> whether written directly to disk or piped to a program. >> >> That is very interesting. Like Olivier mentioned, it's not that actual >> commit, but rather the change of behavior implemented by it. Before >> that >> commit, we'd hit the async workers more often, whereas after we do the >> correct retry method where it's driven by the wakeup when the page is >> unlocked. This is purely speculation, but perhaps the fact that the >> process changes state potentially mid dump is why the dump ends up >> being >> truncated? >> >> I'd love to dive into this and try and figure it out. Absent a test >> case, at least the above gives me an idea of what to try out. I'll see >> if it makes it easier for me to create a case that does result in a >> truncated core dump. >> > Jens, > > When I have first encountered the issue, the very first thing that I > did try was to create a simple test program that would synthetize the > problem. > > After few time consumming failed attempts, I just gave up the idea and > simply settle to my prod program that showcase systematically the > problem every time that I kill the process with a SEGV signal. > > In a nutshell, all the program does is to issue read operations with > io_uring on a TCP socket on which there is a constant data stream. > > Now that I have a better understanding of what is going on, I think > that one way that could reproduce the problem consistently could be > along those lines: > > 1. Create a pipe > 2. fork a child > 3. Initiate a read operation on the pipe with io_uring from the child > 4. Let the parent kill its child with a core dump generating signal. > 5. Write something in the pipe from the parent so that the io_uring > read operation completes while the core dump is generated. > > I guess that I'll end up doing that if I cannot fix the issue with my > current setup but here is what I have attempted so far: > > 1. Call io_uring_files_cancel from do_coredump > 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on > returning from io_uring_files_cancel > > Those attempts didn't work but lurking in the io_uring dev mailing list > is starting to pay off. I thought that I did reach the bottom of the > rabbit hole in my journey of understanding io_uring but the recent > patch set sent by Hao Xu > > https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t > > made me realize that I still haven't assimilated all the small io_uring > nuances... > > Here is my feedback. From my casual io_uring code reader point of view, > it is not 100% obvious what the difference is between > io_uring_files_cancel and io_uring_task_cancel > > It seems like io_uring_files_cancel is cancelling polls only if they > have the REQ_F_INFLIGHT flag set. > > I have no idea what an inflight request means and why someone would > want to call io_uring_files_cancel over io_uring_task_cancel. > > I guess that if I was to meditate on the question for few hours, I > would at some point get some illumination strike me but I believe that > it could be a good idea to document in the code those concepts for > helping casual readers... > > Bottomline, I now understand that io_uring_files_cancel does not cancel > all the requests. Therefore, without fully understanding what I am > doing, I am going to replace my call to io_uring_files_cancel from > do_coredump with io_uring_task_cancel and see if this finally fix the > issue for good. > > What I am trying to do is to cancel pending io_uring requests to make > sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. > > Maybe another solution would simply be to modify __dump_emit to make it > resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally > suggested. > > or maybe do both... > > Not sure which approach is best. If someone has an opinion, I would be > curious to hear it. It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some signal_pending() and cause an interruption of the core dump. Just out of curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's set to some piped process, can you try and set it to 'core' and see if that eliminates the truncation of the core dumps for your case?
On 8/17/21 12:15 PM, Jens Axboe wrote: > On 8/15/21 2:42 PM, Olivier Langlois wrote: >> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: >>> On 8/10/21 3:48 PM, Tony Battersby wrote: >>>> On 8/5/21 9:06 AM, Olivier Langlois wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I didn't forgot about this remaining issue and I have kept thinking >>>>> about it on and off. >>>>> >>>>> I did try the following on 5.12.19: >>>>> >>>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>>> index 07afb5ddb1c4..614fe7a54c1a 100644 >>>>> --- a/fs/coredump.c >>>>> +++ b/fs/coredump.c >>>>> @@ -41,6 +41,7 @@ >>>>> #include <linux/fs.h> >>>>> #include <linux/path.h> >>>>> #include <linux/timekeeping.h> >>>>> +#include <linux/io_uring.h> >>>>> >>>>> #include <linux/uaccess.h> >>>>> #include <asm/mmu_context.h> >>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t >>>>> *siginfo) >>>>> need_suid_safe = true; >>>>> } >>>>> >>>>> + io_uring_files_cancel(current->files); >>>>> + >>>>> retval = coredump_wait(siginfo->si_signo, &core_state); >>>>> if (retval < 0) >>>>> goto fail_creds; >>>>> -- >>>>> 2.32.0 >>>>> >>>>> with my current understanding, io_uring_files_cancel is supposed to >>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL. >>>>> >>>>> I must report that in my testing with generating a core dump >>>>> through a >>>>> pipe with the modif above, I still get truncated core dumps. >>>>> >>>>> systemd is having a weird error: >>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >>>>> process >>>>> >>>>> and nothing is captured >>>>> >>>>> so I have replaced it with a very simple shell: >>>>> $ cat /proc/sys/kernel/core_pattern >>>>>> /home/lano1106/bin/pipe_core.sh %e %p >>>>> >>>>> ~/bin $ cat pipe_core.sh >>>>> #!/bin/sh >>>>> >>>>> cat > /home/lano1106/core/core.$1.$2 >>>>> >>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >>>>> expected core file size >= 24129536, found: 61440 >>>>> >>>>> I conclude from my attempt that maybe io_uring_files_cancel is not >>>>> 100% >>>>> cleaning everything that it should clean. >>>>> >>>>> >>>>> >>>> I just ran into this problem also - coredumps from an io_uring >>>> program >>>> to a pipe are truncated. But I am using kernel 5.10.57, which does >>>> NOT >>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >>>> or >>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). >>>> Kernel 5.4 works though, so I bisected the problem to commit >>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup >>>> properly") in kernel 5.9. Note that my io_uring program uses only >>>> async >>>> buffered reads, which may be why this particular commit makes a >>>> difference to my program. >>>> >>>> My io_uring program is a multi-purpose long-running program with many >>>> threads. Most threads don't use io_uring but a few of them do. >>>> Normally, my core dumps are piped to a program so that they can be >>>> compressed before being written to disk, but I can also test writing >>>> the >>>> core dumps directly to disk. This is what I have found: >>>> >>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers >>>> a >>>> coredump, the core file is written correctly, whether it is written >>>> to >>>> disk or piped to a program, even if another thread is using io_uring >>>> at >>>> the same time. >>>> >>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a >>>> coredump, the core file is truncated, whether written directly to >>>> disk >>>> or piped to a program. >>>> >>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>> triggers a coredump, and the core is written directly to disk, then >>>> it >>>> is written correctly. >>>> >>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>> triggers a coredump, and the core is piped to a program, then it is >>>> truncated. >>>> >>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, >>>> whether written directly to disk or piped to a program. >>> >>> That is very interesting. Like Olivier mentioned, it's not that actual >>> commit, but rather the change of behavior implemented by it. Before >>> that >>> commit, we'd hit the async workers more often, whereas after we do the >>> correct retry method where it's driven by the wakeup when the page is >>> unlocked. This is purely speculation, but perhaps the fact that the >>> process changes state potentially mid dump is why the dump ends up >>> being >>> truncated? >>> >>> I'd love to dive into this and try and figure it out. Absent a test >>> case, at least the above gives me an idea of what to try out. I'll see >>> if it makes it easier for me to create a case that does result in a >>> truncated core dump. >>> >> Jens, >> >> When I have first encountered the issue, the very first thing that I >> did try was to create a simple test program that would synthetize the >> problem. >> >> After few time consumming failed attempts, I just gave up the idea and >> simply settle to my prod program that showcase systematically the >> problem every time that I kill the process with a SEGV signal. >> >> In a nutshell, all the program does is to issue read operations with >> io_uring on a TCP socket on which there is a constant data stream. >> >> Now that I have a better understanding of what is going on, I think >> that one way that could reproduce the problem consistently could be >> along those lines: >> >> 1. Create a pipe >> 2. fork a child >> 3. Initiate a read operation on the pipe with io_uring from the child >> 4. Let the parent kill its child with a core dump generating signal. >> 5. Write something in the pipe from the parent so that the io_uring >> read operation completes while the core dump is generated. >> >> I guess that I'll end up doing that if I cannot fix the issue with my >> current setup but here is what I have attempted so far: >> >> 1. Call io_uring_files_cancel from do_coredump >> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on >> returning from io_uring_files_cancel >> >> Those attempts didn't work but lurking in the io_uring dev mailing list >> is starting to pay off. I thought that I did reach the bottom of the >> rabbit hole in my journey of understanding io_uring but the recent >> patch set sent by Hao Xu >> >> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t >> >> made me realize that I still haven't assimilated all the small io_uring >> nuances... >> >> Here is my feedback. From my casual io_uring code reader point of view, >> it is not 100% obvious what the difference is between >> io_uring_files_cancel and io_uring_task_cancel >> >> It seems like io_uring_files_cancel is cancelling polls only if they >> have the REQ_F_INFLIGHT flag set. >> >> I have no idea what an inflight request means and why someone would >> want to call io_uring_files_cancel over io_uring_task_cancel. >> >> I guess that if I was to meditate on the question for few hours, I >> would at some point get some illumination strike me but I believe that >> it could be a good idea to document in the code those concepts for >> helping casual readers... >> >> Bottomline, I now understand that io_uring_files_cancel does not cancel >> all the requests. Therefore, without fully understanding what I am >> doing, I am going to replace my call to io_uring_files_cancel from >> do_coredump with io_uring_task_cancel and see if this finally fix the >> issue for good. >> >> What I am trying to do is to cancel pending io_uring requests to make >> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. >> >> Maybe another solution would simply be to modify __dump_emit to make it >> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally >> suggested. >> >> or maybe do both... >> >> Not sure which approach is best. If someone has an opinion, I would be >> curious to hear it. > > It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some > signal_pending() and cause an interruption of the core dump. Just out of > curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's > set to some piped process, can you try and set it to 'core' and see if > that eliminates the truncation of the core dumps for your case? And assuming that works, then I suspect this one would fix your issue even with a piped core dump: diff --git a/fs/coredump.c b/fs/coredump.c index 07afb5ddb1c4..852737a9ccbf 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -41,6 +41,7 @@ #include <linux/fs.h> #include <linux/path.h> #include <linux/timekeeping.h> +#include <linux/io_uring.h> #include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) }; audit_core_dumps(siginfo->si_signo); + io_uring_task_cancel(); binfmt = mm->binfmt; if (!binfmt || !binfmt->core_dump)
On 8/17/21 2:24 PM, Jens Axboe wrote: > On 8/17/21 12:15 PM, Jens Axboe wrote: >> On 8/15/21 2:42 PM, Olivier Langlois wrote: >>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: >>>> On 8/10/21 3:48 PM, Tony Battersby wrote: >>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote: >>>>>> Hi all, >>>>>> >>>>>> I didn't forgot about this remaining issue and I have kept thinking >>>>>> about it on and off. >>>>>> >>>>>> I did try the following on 5.12.19: >>>>>> >>>>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644 >>>>>> --- a/fs/coredump.c >>>>>> +++ b/fs/coredump.c >>>>>> @@ -41,6 +41,7 @@ >>>>>> #include <linux/fs.h> >>>>>> #include <linux/path.h> >>>>>> #include <linux/timekeeping.h> >>>>>> +#include <linux/io_uring.h> >>>>>> >>>>>> #include <linux/uaccess.h> >>>>>> #include <asm/mmu_context.h> >>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t >>>>>> *siginfo) >>>>>> need_suid_safe = true; >>>>>> } >>>>>> >>>>>> + io_uring_files_cancel(current->files); >>>>>> + >>>>>> retval = coredump_wait(siginfo->si_signo, &core_state); >>>>>> if (retval < 0) >>>>>> goto fail_creds; >>>>>> -- >>>>>> 2.32.0 >>>>>> >>>>>> with my current understanding, io_uring_files_cancel is supposed to >>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL. >>>>>> >>>>>> I must report that in my testing with generating a core dump >>>>>> through a >>>>>> pipe with the modif above, I still get truncated core dumps. >>>>>> >>>>>> systemd is having a weird error: >>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >>>>>> process >>>>>> >>>>>> and nothing is captured >>>>>> >>>>>> so I have replaced it with a very simple shell: >>>>>> $ cat /proc/sys/kernel/core_pattern >>>>>>> /home/lano1106/bin/pipe_core.sh %e %p >>>>>> ~/bin $ cat pipe_core.sh >>>>>> #!/bin/sh >>>>>> >>>>>> cat > /home/lano1106/core/core.$1.$2 >>>>>> >>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >>>>>> expected core file size >= 24129536, found: 61440 >>>>>> >>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not >>>>>> 100% >>>>>> cleaning everything that it should clean. >>>>>> >>>>>> >>>>>> >>>>> I just ran into this problem also - coredumps from an io_uring >>>>> program >>>>> to a pipe are truncated. But I am using kernel 5.10.57, which does >>>>> NOT >>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >>>>> or >>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). >>>>> Kernel 5.4 works though, so I bisected the problem to commit >>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup >>>>> properly") in kernel 5.9. Note that my io_uring program uses only >>>>> async >>>>> buffered reads, which may be why this particular commit makes a >>>>> difference to my program. >>>>> >>>>> My io_uring program is a multi-purpose long-running program with many >>>>> threads. Most threads don't use io_uring but a few of them do. >>>>> Normally, my core dumps are piped to a program so that they can be >>>>> compressed before being written to disk, but I can also test writing >>>>> the >>>>> core dumps directly to disk. This is what I have found: >>>>> >>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers >>>>> a >>>>> coredump, the core file is written correctly, whether it is written >>>>> to >>>>> disk or piped to a program, even if another thread is using io_uring >>>>> at >>>>> the same time. >>>>> >>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a >>>>> coredump, the core file is truncated, whether written directly to >>>>> disk >>>>> or piped to a program. >>>>> >>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>> triggers a coredump, and the core is written directly to disk, then >>>>> it >>>>> is written correctly. >>>>> >>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>> triggers a coredump, and the core is piped to a program, then it is >>>>> truncated. >>>>> >>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, >>>>> whether written directly to disk or piped to a program. >>>> That is very interesting. Like Olivier mentioned, it's not that actual >>>> commit, but rather the change of behavior implemented by it. Before >>>> that >>>> commit, we'd hit the async workers more often, whereas after we do the >>>> correct retry method where it's driven by the wakeup when the page is >>>> unlocked. This is purely speculation, but perhaps the fact that the >>>> process changes state potentially mid dump is why the dump ends up >>>> being >>>> truncated? >>>> >>>> I'd love to dive into this and try and figure it out. Absent a test >>>> case, at least the above gives me an idea of what to try out. I'll see >>>> if it makes it easier for me to create a case that does result in a >>>> truncated core dump. >>>> >>> Jens, >>> >>> When I have first encountered the issue, the very first thing that I >>> did try was to create a simple test program that would synthetize the >>> problem. >>> >>> After few time consumming failed attempts, I just gave up the idea and >>> simply settle to my prod program that showcase systematically the >>> problem every time that I kill the process with a SEGV signal. >>> >>> In a nutshell, all the program does is to issue read operations with >>> io_uring on a TCP socket on which there is a constant data stream. >>> >>> Now that I have a better understanding of what is going on, I think >>> that one way that could reproduce the problem consistently could be >>> along those lines: >>> >>> 1. Create a pipe >>> 2. fork a child >>> 3. Initiate a read operation on the pipe with io_uring from the child >>> 4. Let the parent kill its child with a core dump generating signal. >>> 5. Write something in the pipe from the parent so that the io_uring >>> read operation completes while the core dump is generated. >>> >>> I guess that I'll end up doing that if I cannot fix the issue with my >>> current setup but here is what I have attempted so far: >>> >>> 1. Call io_uring_files_cancel from do_coredump >>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on >>> returning from io_uring_files_cancel >>> >>> Those attempts didn't work but lurking in the io_uring dev mailing list >>> is starting to pay off. I thought that I did reach the bottom of the >>> rabbit hole in my journey of understanding io_uring but the recent >>> patch set sent by Hao Xu >>> >>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t >>> >>> made me realize that I still haven't assimilated all the small io_uring >>> nuances... >>> >>> Here is my feedback. From my casual io_uring code reader point of view, >>> it is not 100% obvious what the difference is between >>> io_uring_files_cancel and io_uring_task_cancel >>> >>> It seems like io_uring_files_cancel is cancelling polls only if they >>> have the REQ_F_INFLIGHT flag set. >>> >>> I have no idea what an inflight request means and why someone would >>> want to call io_uring_files_cancel over io_uring_task_cancel. >>> >>> I guess that if I was to meditate on the question for few hours, I >>> would at some point get some illumination strike me but I believe that >>> it could be a good idea to document in the code those concepts for >>> helping casual readers... >>> >>> Bottomline, I now understand that io_uring_files_cancel does not cancel >>> all the requests. Therefore, without fully understanding what I am >>> doing, I am going to replace my call to io_uring_files_cancel from >>> do_coredump with io_uring_task_cancel and see if this finally fix the >>> issue for good. >>> >>> What I am trying to do is to cancel pending io_uring requests to make >>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. >>> >>> Maybe another solution would simply be to modify __dump_emit to make it >>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally >>> suggested. >>> >>> or maybe do both... >>> >>> Not sure which approach is best. If someone has an opinion, I would be >>> curious to hear it. >> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some >> signal_pending() and cause an interruption of the core dump. Just out of >> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's >> set to some piped process, can you try and set it to 'core' and see if >> that eliminates the truncation of the core dumps for your case? > And assuming that works, then I suspect this one would fix your issue > even with a piped core dump: > > diff --git a/fs/coredump.c b/fs/coredump.c > index 07afb5ddb1c4..852737a9ccbf 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -41,6 +41,7 @@ > #include <linux/fs.h> > #include <linux/path.h> > #include <linux/timekeeping.h> > +#include <linux/io_uring.h> > > #include <linux/uaccess.h> > #include <asm/mmu_context.h> > @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) > }; > > audit_core_dumps(siginfo->si_signo); > + io_uring_task_cancel(); > > binfmt = mm->binfmt; > if (!binfmt || !binfmt->core_dump) > FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above with my io_uring program. The coredump locked up even when writing the core file directly to disk; the zombie process could not be killed with "kill -9". Unfortunately I can't test with newer kernels without spending some time on it, and I am too busy with other stuff right now. My io_uring program does async buffered reads (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition (no filesystem). One thread submits I/Os while another thread calls io_uring_wait_cqe() and processes the completions. To trigger the coredump, I added an intentional abort() in the thread that submits I/Os after running for a second. Tony Battersby
On 8/17/21 1:29 PM, Tony Battersby wrote: > On 8/17/21 2:24 PM, Jens Axboe wrote: >> On 8/17/21 12:15 PM, Jens Axboe wrote: >>> On 8/15/21 2:42 PM, Olivier Langlois wrote: >>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: >>>>> On 8/10/21 3:48 PM, Tony Battersby wrote: >>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> I didn't forgot about this remaining issue and I have kept thinking >>>>>>> about it on and off. >>>>>>> >>>>>>> I did try the following on 5.12.19: >>>>>>> >>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644 >>>>>>> --- a/fs/coredump.c >>>>>>> +++ b/fs/coredump.c >>>>>>> @@ -41,6 +41,7 @@ >>>>>>> #include <linux/fs.h> >>>>>>> #include <linux/path.h> >>>>>>> #include <linux/timekeeping.h> >>>>>>> +#include <linux/io_uring.h> >>>>>>> >>>>>>> #include <linux/uaccess.h> >>>>>>> #include <asm/mmu_context.h> >>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t >>>>>>> *siginfo) >>>>>>> need_suid_safe = true; >>>>>>> } >>>>>>> >>>>>>> + io_uring_files_cancel(current->files); >>>>>>> + >>>>>>> retval = coredump_wait(siginfo->si_signo, &core_state); >>>>>>> if (retval < 0) >>>>>>> goto fail_creds; >>>>>>> -- >>>>>>> 2.32.0 >>>>>>> >>>>>>> with my current understanding, io_uring_files_cancel is supposed to >>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL. >>>>>>> >>>>>>> I must report that in my testing with generating a core dump >>>>>>> through a >>>>>>> pipe with the modif above, I still get truncated core dumps. >>>>>>> >>>>>>> systemd is having a weird error: >>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >>>>>>> process >>>>>>> >>>>>>> and nothing is captured >>>>>>> >>>>>>> so I have replaced it with a very simple shell: >>>>>>> $ cat /proc/sys/kernel/core_pattern >>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p >>>>>>> ~/bin $ cat pipe_core.sh >>>>>>> #!/bin/sh >>>>>>> >>>>>>> cat > /home/lano1106/core/core.$1.$2 >>>>>>> >>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >>>>>>> expected core file size >= 24129536, found: 61440 >>>>>>> >>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not >>>>>>> 100% >>>>>>> cleaning everything that it should clean. >>>>>>> >>>>>>> >>>>>>> >>>>>> I just ran into this problem also - coredumps from an io_uring >>>>>> program >>>>>> to a pipe are truncated. But I am using kernel 5.10.57, which does >>>>>> NOT >>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >>>>>> or >>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). >>>>>> Kernel 5.4 works though, so I bisected the problem to commit >>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup >>>>>> properly") in kernel 5.9. Note that my io_uring program uses only >>>>>> async >>>>>> buffered reads, which may be why this particular commit makes a >>>>>> difference to my program. >>>>>> >>>>>> My io_uring program is a multi-purpose long-running program with many >>>>>> threads. Most threads don't use io_uring but a few of them do. >>>>>> Normally, my core dumps are piped to a program so that they can be >>>>>> compressed before being written to disk, but I can also test writing >>>>>> the >>>>>> core dumps directly to disk. This is what I have found: >>>>>> >>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers >>>>>> a >>>>>> coredump, the core file is written correctly, whether it is written >>>>>> to >>>>>> disk or piped to a program, even if another thread is using io_uring >>>>>> at >>>>>> the same time. >>>>>> >>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a >>>>>> coredump, the core file is truncated, whether written directly to >>>>>> disk >>>>>> or piped to a program. >>>>>> >>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>>> triggers a coredump, and the core is written directly to disk, then >>>>>> it >>>>>> is written correctly. >>>>>> >>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>>> triggers a coredump, and the core is piped to a program, then it is >>>>>> truncated. >>>>>> >>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, >>>>>> whether written directly to disk or piped to a program. >>>>> That is very interesting. Like Olivier mentioned, it's not that actual >>>>> commit, but rather the change of behavior implemented by it. Before >>>>> that >>>>> commit, we'd hit the async workers more often, whereas after we do the >>>>> correct retry method where it's driven by the wakeup when the page is >>>>> unlocked. This is purely speculation, but perhaps the fact that the >>>>> process changes state potentially mid dump is why the dump ends up >>>>> being >>>>> truncated? >>>>> >>>>> I'd love to dive into this and try and figure it out. Absent a test >>>>> case, at least the above gives me an idea of what to try out. I'll see >>>>> if it makes it easier for me to create a case that does result in a >>>>> truncated core dump. >>>>> >>>> Jens, >>>> >>>> When I have first encountered the issue, the very first thing that I >>>> did try was to create a simple test program that would synthetize the >>>> problem. >>>> >>>> After few time consumming failed attempts, I just gave up the idea and >>>> simply settle to my prod program that showcase systematically the >>>> problem every time that I kill the process with a SEGV signal. >>>> >>>> In a nutshell, all the program does is to issue read operations with >>>> io_uring on a TCP socket on which there is a constant data stream. >>>> >>>> Now that I have a better understanding of what is going on, I think >>>> that one way that could reproduce the problem consistently could be >>>> along those lines: >>>> >>>> 1. Create a pipe >>>> 2. fork a child >>>> 3. Initiate a read operation on the pipe with io_uring from the child >>>> 4. Let the parent kill its child with a core dump generating signal. >>>> 5. Write something in the pipe from the parent so that the io_uring >>>> read operation completes while the core dump is generated. >>>> >>>> I guess that I'll end up doing that if I cannot fix the issue with my >>>> current setup but here is what I have attempted so far: >>>> >>>> 1. Call io_uring_files_cancel from do_coredump >>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on >>>> returning from io_uring_files_cancel >>>> >>>> Those attempts didn't work but lurking in the io_uring dev mailing list >>>> is starting to pay off. I thought that I did reach the bottom of the >>>> rabbit hole in my journey of understanding io_uring but the recent >>>> patch set sent by Hao Xu >>>> >>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t >>>> >>>> made me realize that I still haven't assimilated all the small io_uring >>>> nuances... >>>> >>>> Here is my feedback. From my casual io_uring code reader point of view, >>>> it is not 100% obvious what the difference is between >>>> io_uring_files_cancel and io_uring_task_cancel >>>> >>>> It seems like io_uring_files_cancel is cancelling polls only if they >>>> have the REQ_F_INFLIGHT flag set. >>>> >>>> I have no idea what an inflight request means and why someone would >>>> want to call io_uring_files_cancel over io_uring_task_cancel. >>>> >>>> I guess that if I was to meditate on the question for few hours, I >>>> would at some point get some illumination strike me but I believe that >>>> it could be a good idea to document in the code those concepts for >>>> helping casual readers... >>>> >>>> Bottomline, I now understand that io_uring_files_cancel does not cancel >>>> all the requests. Therefore, without fully understanding what I am >>>> doing, I am going to replace my call to io_uring_files_cancel from >>>> do_coredump with io_uring_task_cancel and see if this finally fix the >>>> issue for good. >>>> >>>> What I am trying to do is to cancel pending io_uring requests to make >>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. >>>> >>>> Maybe another solution would simply be to modify __dump_emit to make it >>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally >>>> suggested. >>>> >>>> or maybe do both... >>>> >>>> Not sure which approach is best. If someone has an opinion, I would be >>>> curious to hear it. >>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some >>> signal_pending() and cause an interruption of the core dump. Just out of >>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's >>> set to some piped process, can you try and set it to 'core' and see if >>> that eliminates the truncation of the core dumps for your case? >> And assuming that works, then I suspect this one would fix your issue >> even with a piped core dump: >> >> diff --git a/fs/coredump.c b/fs/coredump.c >> index 07afb5ddb1c4..852737a9ccbf 100644 >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -41,6 +41,7 @@ >> #include <linux/fs.h> >> #include <linux/path.h> >> #include <linux/timekeeping.h> >> +#include <linux/io_uring.h> >> >> #include <linux/uaccess.h> >> #include <asm/mmu_context.h> >> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) >> }; >> >> audit_core_dumps(siginfo->si_signo); >> + io_uring_task_cancel(); >> >> binfmt = mm->binfmt; >> if (!binfmt || !binfmt->core_dump) >> > FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above > with my io_uring program. The coredump locked up even when writing the > core file directly to disk; the zombie process could not be killed with > "kill -9". Unfortunately I can't test with newer kernels without > spending some time on it, and I am too busy with other stuff right now. That sounds like 5.10-stable is missing some of the cancelation backports, and your setup makes the cancelation stall because of that. Need to go over the 11/12/13 fixes and ensure that we've got everything we need for those stable versions, particularly 5.10. > My io_uring program does async buffered reads > (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition > (no filesystem). One thread submits I/Os while another thread calls > io_uring_wait_cqe() and processes the completions. To trigger the > coredump, I added an intentional abort() in the thread that submits I/Os > after running for a second. OK, so that one is also using task_work for the retry based async buffered reads, so it makes sense. Maybe a temporary work-around is to use 06af8679449d and eliminate the pipe based coredump?
On 8/17/21 1:59 PM, Jens Axboe wrote: > On 8/17/21 1:29 PM, Tony Battersby wrote: >> On 8/17/21 2:24 PM, Jens Axboe wrote: >>> On 8/17/21 12:15 PM, Jens Axboe wrote: >>>> On 8/15/21 2:42 PM, Olivier Langlois wrote: >>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: >>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote: >>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I didn't forgot about this remaining issue and I have kept thinking >>>>>>>> about it on and off. >>>>>>>> >>>>>>>> I did try the following on 5.12.19: >>>>>>>> >>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644 >>>>>>>> --- a/fs/coredump.c >>>>>>>> +++ b/fs/coredump.c >>>>>>>> @@ -41,6 +41,7 @@ >>>>>>>> #include <linux/fs.h> >>>>>>>> #include <linux/path.h> >>>>>>>> #include <linux/timekeeping.h> >>>>>>>> +#include <linux/io_uring.h> >>>>>>>> >>>>>>>> #include <linux/uaccess.h> >>>>>>>> #include <asm/mmu_context.h> >>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t >>>>>>>> *siginfo) >>>>>>>> need_suid_safe = true; >>>>>>>> } >>>>>>>> >>>>>>>> + io_uring_files_cancel(current->files); >>>>>>>> + >>>>>>>> retval = coredump_wait(siginfo->si_signo, &core_state); >>>>>>>> if (retval < 0) >>>>>>>> goto fail_creds; >>>>>>>> -- >>>>>>>> 2.32.0 >>>>>>>> >>>>>>>> with my current understanding, io_uring_files_cancel is supposed to >>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL. >>>>>>>> >>>>>>>> I must report that in my testing with generating a core dump >>>>>>>> through a >>>>>>>> pipe with the modif above, I still get truncated core dumps. >>>>>>>> >>>>>>>> systemd is having a weird error: >>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >>>>>>>> process >>>>>>>> >>>>>>>> and nothing is captured >>>>>>>> >>>>>>>> so I have replaced it with a very simple shell: >>>>>>>> $ cat /proc/sys/kernel/core_pattern >>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p >>>>>>>> ~/bin $ cat pipe_core.sh >>>>>>>> #!/bin/sh >>>>>>>> >>>>>>>> cat > /home/lano1106/core/core.$1.$2 >>>>>>>> >>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >>>>>>>> expected core file size >= 24129536, found: 61440 >>>>>>>> >>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not >>>>>>>> 100% >>>>>>>> cleaning everything that it should clean. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I just ran into this problem also - coredumps from an io_uring >>>>>>> program >>>>>>> to a pipe are truncated. But I am using kernel 5.10.57, which does >>>>>>> NOT >>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >>>>>>> or >>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). >>>>>>> Kernel 5.4 works though, so I bisected the problem to commit >>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup >>>>>>> properly") in kernel 5.9. Note that my io_uring program uses only >>>>>>> async >>>>>>> buffered reads, which may be why this particular commit makes a >>>>>>> difference to my program. >>>>>>> >>>>>>> My io_uring program is a multi-purpose long-running program with many >>>>>>> threads. Most threads don't use io_uring but a few of them do. >>>>>>> Normally, my core dumps are piped to a program so that they can be >>>>>>> compressed before being written to disk, but I can also test writing >>>>>>> the >>>>>>> core dumps directly to disk. This is what I have found: >>>>>>> >>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers >>>>>>> a >>>>>>> coredump, the core file is written correctly, whether it is written >>>>>>> to >>>>>>> disk or piped to a program, even if another thread is using io_uring >>>>>>> at >>>>>>> the same time. >>>>>>> >>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a >>>>>>> coredump, the core file is truncated, whether written directly to >>>>>>> disk >>>>>>> or piped to a program. >>>>>>> >>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>>>> triggers a coredump, and the core is written directly to disk, then >>>>>>> it >>>>>>> is written correctly. >>>>>>> >>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>>>> triggers a coredump, and the core is piped to a program, then it is >>>>>>> truncated. >>>>>>> >>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, >>>>>>> whether written directly to disk or piped to a program. >>>>>> That is very interesting. Like Olivier mentioned, it's not that actual >>>>>> commit, but rather the change of behavior implemented by it. Before >>>>>> that >>>>>> commit, we'd hit the async workers more often, whereas after we do the >>>>>> correct retry method where it's driven by the wakeup when the page is >>>>>> unlocked. This is purely speculation, but perhaps the fact that the >>>>>> process changes state potentially mid dump is why the dump ends up >>>>>> being >>>>>> truncated? >>>>>> >>>>>> I'd love to dive into this and try and figure it out. Absent a test >>>>>> case, at least the above gives me an idea of what to try out. I'll see >>>>>> if it makes it easier for me to create a case that does result in a >>>>>> truncated core dump. >>>>>> >>>>> Jens, >>>>> >>>>> When I have first encountered the issue, the very first thing that I >>>>> did try was to create a simple test program that would synthetize the >>>>> problem. >>>>> >>>>> After few time consumming failed attempts, I just gave up the idea and >>>>> simply settle to my prod program that showcase systematically the >>>>> problem every time that I kill the process with a SEGV signal. >>>>> >>>>> In a nutshell, all the program does is to issue read operations with >>>>> io_uring on a TCP socket on which there is a constant data stream. >>>>> >>>>> Now that I have a better understanding of what is going on, I think >>>>> that one way that could reproduce the problem consistently could be >>>>> along those lines: >>>>> >>>>> 1. Create a pipe >>>>> 2. fork a child >>>>> 3. Initiate a read operation on the pipe with io_uring from the child >>>>> 4. Let the parent kill its child with a core dump generating signal. >>>>> 5. Write something in the pipe from the parent so that the io_uring >>>>> read operation completes while the core dump is generated. >>>>> >>>>> I guess that I'll end up doing that if I cannot fix the issue with my >>>>> current setup but here is what I have attempted so far: >>>>> >>>>> 1. Call io_uring_files_cancel from do_coredump >>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on >>>>> returning from io_uring_files_cancel >>>>> >>>>> Those attempts didn't work but lurking in the io_uring dev mailing list >>>>> is starting to pay off. I thought that I did reach the bottom of the >>>>> rabbit hole in my journey of understanding io_uring but the recent >>>>> patch set sent by Hao Xu >>>>> >>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t >>>>> >>>>> made me realize that I still haven't assimilated all the small io_uring >>>>> nuances... >>>>> >>>>> Here is my feedback. From my casual io_uring code reader point of view, >>>>> it is not 100% obvious what the difference is between >>>>> io_uring_files_cancel and io_uring_task_cancel >>>>> >>>>> It seems like io_uring_files_cancel is cancelling polls only if they >>>>> have the REQ_F_INFLIGHT flag set. >>>>> >>>>> I have no idea what an inflight request means and why someone would >>>>> want to call io_uring_files_cancel over io_uring_task_cancel. >>>>> >>>>> I guess that if I was to meditate on the question for few hours, I >>>>> would at some point get some illumination strike me but I believe that >>>>> it could be a good idea to document in the code those concepts for >>>>> helping casual readers... >>>>> >>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel >>>>> all the requests. Therefore, without fully understanding what I am >>>>> doing, I am going to replace my call to io_uring_files_cancel from >>>>> do_coredump with io_uring_task_cancel and see if this finally fix the >>>>> issue for good. >>>>> >>>>> What I am trying to do is to cancel pending io_uring requests to make >>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. >>>>> >>>>> Maybe another solution would simply be to modify __dump_emit to make it >>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally >>>>> suggested. >>>>> >>>>> or maybe do both... >>>>> >>>>> Not sure which approach is best. If someone has an opinion, I would be >>>>> curious to hear it. >>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some >>>> signal_pending() and cause an interruption of the core dump. Just out of >>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's >>>> set to some piped process, can you try and set it to 'core' and see if >>>> that eliminates the truncation of the core dumps for your case? >>> And assuming that works, then I suspect this one would fix your issue >>> even with a piped core dump: >>> >>> diff --git a/fs/coredump.c b/fs/coredump.c >>> index 07afb5ddb1c4..852737a9ccbf 100644 >>> --- a/fs/coredump.c >>> +++ b/fs/coredump.c >>> @@ -41,6 +41,7 @@ >>> #include <linux/fs.h> >>> #include <linux/path.h> >>> #include <linux/timekeeping.h> >>> +#include <linux/io_uring.h> >>> >>> #include <linux/uaccess.h> >>> #include <asm/mmu_context.h> >>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) >>> }; >>> >>> audit_core_dumps(siginfo->si_signo); >>> + io_uring_task_cancel(); >>> >>> binfmt = mm->binfmt; >>> if (!binfmt || !binfmt->core_dump) >>> >> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above >> with my io_uring program. The coredump locked up even when writing the >> core file directly to disk; the zombie process could not be killed with >> "kill -9". Unfortunately I can't test with newer kernels without >> spending some time on it, and I am too busy with other stuff right now. > > That sounds like 5.10-stable is missing some of the cancelation > backports, and your setup makes the cancelation stall because of that. > Need to go over the 11/12/13 fixes and ensure that we've got everything > we need for those stable versions, particularly 5.10. > >> My io_uring program does async buffered reads >> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition >> (no filesystem). One thread submits I/Os while another thread calls >> io_uring_wait_cqe() and processes the completions. To trigger the >> coredump, I added an intentional abort() in the thread that submits I/Os >> after running for a second. > > OK, so that one is also using task_work for the retry based async > buffered reads, so it makes sense. > > Maybe a temporary work-around is to use 06af8679449d and eliminate the > pipe based coredump? Another approach - don't allow TWA_SIGNAL task_work to get queued if PF_SIGNALED has been set on the task. This is similar to how we reject task_work_add() on process exit, and the callers must be able to handle that already. Can you test this one on top of your 5.10-stable? diff --git a/fs/coredump.c b/fs/coredump.c index 07afb5ddb1c4..ca7c1ee44ada 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) .mm_flags = mm->flags, }; + /* + * task_work_add() will refuse to add work after PF_SIGNALED has + * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work + * if any was queued before that. + */ + if (test_thread_flag(TIF_NOTIFY_SIGNAL)) + tracehook_notify_signal(); + audit_core_dumps(siginfo->si_signo); binfmt = mm->binfmt; diff --git a/kernel/task_work.c b/kernel/task_work.c index 1698fbe6f0e1..1ab28904adc4 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, head = READ_ONCE(task->task_works); if (unlikely(head == &work_exited)) return -ESRCH; + /* + * TIF_NOTIFY_SIGNAL notifications will interfere with + * a core dump in progress, reject them. + */ + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) + return -ESRCH; work->next = head; } while (cmpxchg(&task->task_works, head, work) != head);
On 8/17/21 5:28 PM, Jens Axboe wrote: > > Another approach - don't allow TWA_SIGNAL task_work to get queued if > PF_SIGNALED has been set on the task. This is similar to how we reject > task_work_add() on process exit, and the callers must be able to handle > that already. > > Can you test this one on top of your 5.10-stable? > > > diff --git a/fs/coredump.c b/fs/coredump.c > index 07afb5ddb1c4..ca7c1ee44ada 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) > .mm_flags = mm->flags, > }; > > + /* > + * task_work_add() will refuse to add work after PF_SIGNALED has > + * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work > + * if any was queued before that. > + */ > + if (test_thread_flag(TIF_NOTIFY_SIGNAL)) > + tracehook_notify_signal(); > + > audit_core_dumps(siginfo->si_signo); > > binfmt = mm->binfmt; > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 1698fbe6f0e1..1ab28904adc4 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, > head = READ_ONCE(task->task_works); > if (unlikely(head == &work_exited)) > return -ESRCH; > + /* > + * TIF_NOTIFY_SIGNAL notifications will interfere with > + * a core dump in progress, reject them. > + */ > + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) > + return -ESRCH; > work->next = head; > } while (cmpxchg(&task->task_works, head, work) != head); > > Doesn't compile. 5.10 doesn't have TIF_NOTIFY_SIGNAL. Tony Battersby
On 8/17/21 3:39 PM, Tony Battersby wrote: > On 8/17/21 5:28 PM, Jens Axboe wrote: >> >> Another approach - don't allow TWA_SIGNAL task_work to get queued if >> PF_SIGNALED has been set on the task. This is similar to how we reject >> task_work_add() on process exit, and the callers must be able to handle >> that already. >> >> Can you test this one on top of your 5.10-stable? >> >> >> diff --git a/fs/coredump.c b/fs/coredump.c >> index 07afb5ddb1c4..ca7c1ee44ada 100644 >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) >> .mm_flags = mm->flags, >> }; >> >> + /* >> + * task_work_add() will refuse to add work after PF_SIGNALED has >> + * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work >> + * if any was queued before that. >> + */ >> + if (test_thread_flag(TIF_NOTIFY_SIGNAL)) >> + tracehook_notify_signal(); >> + >> audit_core_dumps(siginfo->si_signo); >> >> binfmt = mm->binfmt; >> diff --git a/kernel/task_work.c b/kernel/task_work.c >> index 1698fbe6f0e1..1ab28904adc4 100644 >> --- a/kernel/task_work.c >> +++ b/kernel/task_work.c >> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, >> head = READ_ONCE(task->task_works); >> if (unlikely(head == &work_exited)) >> return -ESRCH; >> + /* >> + * TIF_NOTIFY_SIGNAL notifications will interfere with >> + * a core dump in progress, reject them. >> + */ >> + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) >> + return -ESRCH; >> work->next = head; >> } while (cmpxchg(&task->task_works, head, work) != head); >> >> > Doesn't compile. 5.10 doesn't have TIF_NOTIFY_SIGNAL. Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally untested... diff --git a/fs/coredump.c b/fs/coredump.c index c6acfc694f65..9e899ce67589 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo) .mm_flags = mm->flags, }; + /* + * task_work_add() will refuse to add work after PF_SIGNALED has + * been set, ensure that we flush any pending TWA_SIGNAL work + * if any was queued before that. + */ + if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) { + task_work_run(); + spin_lock_irq(¤t->sighand->siglock); + current->jobctl &= ~JOBCTL_TASK_WORK; + recalc_sigpending(); + spin_unlock_irq(¤t->sighand->siglock); + } + audit_core_dumps(siginfo->si_signo); binfmt = mm->binfmt; diff --git a/kernel/task_work.c b/kernel/task_work.c index 8d6e1217c451..93b3f262eb4a 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, head = READ_ONCE(task->task_works); if (unlikely(head == &work_exited)) return -ESRCH; + /* + * TWA_SIGNAL notifications will interfere with + * a core dump in progress, reject them. + */ + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) + return -ESRCH; work->next = head; } while (cmpxchg(&task->task_works, head, work) != head);
On 8/17/21 3:28 PM, Jens Axboe wrote: > On 8/17/21 1:59 PM, Jens Axboe wrote: >> On 8/17/21 1:29 PM, Tony Battersby wrote: >>> On 8/17/21 2:24 PM, Jens Axboe wrote: >>>> On 8/17/21 12:15 PM, Jens Axboe wrote: >>>>> On 8/15/21 2:42 PM, Olivier Langlois wrote: >>>>>> On Wed, 2021-08-11 at 19:55 -0600, Jens Axboe wrote: >>>>>>> On 8/10/21 3:48 PM, Tony Battersby wrote: >>>>>>>> On 8/5/21 9:06 AM, Olivier Langlois wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I didn't forgot about this remaining issue and I have kept thinking >>>>>>>>> about it on and off. >>>>>>>>> >>>>>>>>> I did try the following on 5.12.19: >>>>>>>>> >>>>>>>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>>>>>>> index 07afb5ddb1c4..614fe7a54c1a 100644 >>>>>>>>> --- a/fs/coredump.c >>>>>>>>> +++ b/fs/coredump.c >>>>>>>>> @@ -41,6 +41,7 @@ >>>>>>>>> #include <linux/fs.h> >>>>>>>>> #include <linux/path.h> >>>>>>>>> #include <linux/timekeeping.h> >>>>>>>>> +#include <linux/io_uring.h> >>>>>>>>> >>>>>>>>> #include <linux/uaccess.h> >>>>>>>>> #include <asm/mmu_context.h> >>>>>>>>> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t >>>>>>>>> *siginfo) >>>>>>>>> need_suid_safe = true; >>>>>>>>> } >>>>>>>>> >>>>>>>>> + io_uring_files_cancel(current->files); >>>>>>>>> + >>>>>>>>> retval = coredump_wait(siginfo->si_signo, &core_state); >>>>>>>>> if (retval < 0) >>>>>>>>> goto fail_creds; >>>>>>>>> -- >>>>>>>>> 2.32.0 >>>>>>>>> >>>>>>>>> with my current understanding, io_uring_files_cancel is supposed to >>>>>>>>> cancel everything that might set the TIF_NOTIFY_SIGNAL. >>>>>>>>> >>>>>>>>> I must report that in my testing with generating a core dump >>>>>>>>> through a >>>>>>>>> pipe with the modif above, I still get truncated core dumps. >>>>>>>>> >>>>>>>>> systemd is having a weird error: >>>>>>>>> [ 2577.870742] systemd-coredump[4056]: Failed to get COMM: No such >>>>>>>>> process >>>>>>>>> >>>>>>>>> and nothing is captured >>>>>>>>> >>>>>>>>> so I have replaced it with a very simple shell: >>>>>>>>> $ cat /proc/sys/kernel/core_pattern >>>>>>>>>> /home/lano1106/bin/pipe_core.sh %e %p >>>>>>>>> ~/bin $ cat pipe_core.sh >>>>>>>>> #!/bin/sh >>>>>>>>> >>>>>>>>> cat > /home/lano1106/core/core.$1.$2 >>>>>>>>> >>>>>>>>> BFD: warning: /home/lano1106/core/core.test.10886 is truncated: >>>>>>>>> expected core file size >= 24129536, found: 61440 >>>>>>>>> >>>>>>>>> I conclude from my attempt that maybe io_uring_files_cancel is not >>>>>>>>> 100% >>>>>>>>> cleaning everything that it should clean. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> I just ran into this problem also - coredumps from an io_uring >>>>>>>> program >>>>>>>> to a pipe are truncated. But I am using kernel 5.10.57, which does >>>>>>>> NOT >>>>>>>> have commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") >>>>>>>> or >>>>>>>> commit 06af8679449d ("coredump: Limit what can interrupt coredumps"). >>>>>>>> Kernel 5.4 works though, so I bisected the problem to commit >>>>>>>> f38c7e3abfba ("io_uring: ensure async buffered read-retry is setup >>>>>>>> properly") in kernel 5.9. Note that my io_uring program uses only >>>>>>>> async >>>>>>>> buffered reads, which may be why this particular commit makes a >>>>>>>> difference to my program. >>>>>>>> >>>>>>>> My io_uring program is a multi-purpose long-running program with many >>>>>>>> threads. Most threads don't use io_uring but a few of them do. >>>>>>>> Normally, my core dumps are piped to a program so that they can be >>>>>>>> compressed before being written to disk, but I can also test writing >>>>>>>> the >>>>>>>> core dumps directly to disk. This is what I have found: >>>>>>>> >>>>>>>> *) Unpatched 5.10.57: if a thread that doesn't use io_uring triggers >>>>>>>> a >>>>>>>> coredump, the core file is written correctly, whether it is written >>>>>>>> to >>>>>>>> disk or piped to a program, even if another thread is using io_uring >>>>>>>> at >>>>>>>> the same time. >>>>>>>> >>>>>>>> *) Unpatched 5.10.57: if a thread that uses io_uring triggers a >>>>>>>> coredump, the core file is truncated, whether written directly to >>>>>>>> disk >>>>>>>> or piped to a program. >>>>>>>> >>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>>>>> triggers a coredump, and the core is written directly to disk, then >>>>>>>> it >>>>>>>> is written correctly. >>>>>>>> >>>>>>>> *) 5.10.57+backport 06af8679449d: if a thread that uses io_uring >>>>>>>> triggers a coredump, and the core is piped to a program, then it is >>>>>>>> truncated. >>>>>>>> >>>>>>>> *) 5.10.57+revert f38c7e3abfba: core dumps are written correctly, >>>>>>>> whether written directly to disk or piped to a program. >>>>>>> That is very interesting. Like Olivier mentioned, it's not that actual >>>>>>> commit, but rather the change of behavior implemented by it. Before >>>>>>> that >>>>>>> commit, we'd hit the async workers more often, whereas after we do the >>>>>>> correct retry method where it's driven by the wakeup when the page is >>>>>>> unlocked. This is purely speculation, but perhaps the fact that the >>>>>>> process changes state potentially mid dump is why the dump ends up >>>>>>> being >>>>>>> truncated? >>>>>>> >>>>>>> I'd love to dive into this and try and figure it out. Absent a test >>>>>>> case, at least the above gives me an idea of what to try out. I'll see >>>>>>> if it makes it easier for me to create a case that does result in a >>>>>>> truncated core dump. >>>>>>> >>>>>> Jens, >>>>>> >>>>>> When I have first encountered the issue, the very first thing that I >>>>>> did try was to create a simple test program that would synthetize the >>>>>> problem. >>>>>> >>>>>> After few time consumming failed attempts, I just gave up the idea and >>>>>> simply settle to my prod program that showcase systematically the >>>>>> problem every time that I kill the process with a SEGV signal. >>>>>> >>>>>> In a nutshell, all the program does is to issue read operations with >>>>>> io_uring on a TCP socket on which there is a constant data stream. >>>>>> >>>>>> Now that I have a better understanding of what is going on, I think >>>>>> that one way that could reproduce the problem consistently could be >>>>>> along those lines: >>>>>> >>>>>> 1. Create a pipe >>>>>> 2. fork a child >>>>>> 3. Initiate a read operation on the pipe with io_uring from the child >>>>>> 4. Let the parent kill its child with a core dump generating signal. >>>>>> 5. Write something in the pipe from the parent so that the io_uring >>>>>> read operation completes while the core dump is generated. >>>>>> >>>>>> I guess that I'll end up doing that if I cannot fix the issue with my >>>>>> current setup but here is what I have attempted so far: >>>>>> >>>>>> 1. Call io_uring_files_cancel from do_coredump >>>>>> 2. Same as #1 but also make sure that TIF_NOTIFY_SIGNAL is cleared on >>>>>> returning from io_uring_files_cancel >>>>>> >>>>>> Those attempts didn't work but lurking in the io_uring dev mailing list >>>>>> is starting to pay off. I thought that I did reach the bottom of the >>>>>> rabbit hole in my journey of understanding io_uring but the recent >>>>>> patch set sent by Hao Xu >>>>>> >>>>>> https://lore.kernel.org/io-uring/90fce498-968e-6812-7b6a-fdf8520ea8d9@kernel.dk/T/#t >>>>>> >>>>>> made me realize that I still haven't assimilated all the small io_uring >>>>>> nuances... >>>>>> >>>>>> Here is my feedback. From my casual io_uring code reader point of view, >>>>>> it is not 100% obvious what the difference is between >>>>>> io_uring_files_cancel and io_uring_task_cancel >>>>>> >>>>>> It seems like io_uring_files_cancel is cancelling polls only if they >>>>>> have the REQ_F_INFLIGHT flag set. >>>>>> >>>>>> I have no idea what an inflight request means and why someone would >>>>>> want to call io_uring_files_cancel over io_uring_task_cancel. >>>>>> >>>>>> I guess that if I was to meditate on the question for few hours, I >>>>>> would at some point get some illumination strike me but I believe that >>>>>> it could be a good idea to document in the code those concepts for >>>>>> helping casual readers... >>>>>> >>>>>> Bottomline, I now understand that io_uring_files_cancel does not cancel >>>>>> all the requests. Therefore, without fully understanding what I am >>>>>> doing, I am going to replace my call to io_uring_files_cancel from >>>>>> do_coredump with io_uring_task_cancel and see if this finally fix the >>>>>> issue for good. >>>>>> >>>>>> What I am trying to do is to cancel pending io_uring requests to make >>>>>> sure that TIF_NOTIFY_SIGNAL isn't set while core dump is generated. >>>>>> >>>>>> Maybe another solution would simply be to modify __dump_emit to make it >>>>>> resilient to TIF_NOTIFY_SIGNAL as Eric W. Biederman originally >>>>>> suggested. >>>>>> >>>>>> or maybe do both... >>>>>> >>>>>> Not sure which approach is best. If someone has an opinion, I would be >>>>>> curious to hear it. >>>>> It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger some >>>>> signal_pending() and cause an interruption of the core dump. Just out of >>>>> curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's >>>>> set to some piped process, can you try and set it to 'core' and see if >>>>> that eliminates the truncation of the core dumps for your case? >>>> And assuming that works, then I suspect this one would fix your issue >>>> even with a piped core dump: >>>> >>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>> index 07afb5ddb1c4..852737a9ccbf 100644 >>>> --- a/fs/coredump.c >>>> +++ b/fs/coredump.c >>>> @@ -41,6 +41,7 @@ >>>> #include <linux/fs.h> >>>> #include <linux/path.h> >>>> #include <linux/timekeeping.h> >>>> +#include <linux/io_uring.h> >>>> >>>> #include <linux/uaccess.h> >>>> #include <asm/mmu_context.h> >>>> @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) >>>> }; >>>> >>>> audit_core_dumps(siginfo->si_signo); >>>> + io_uring_task_cancel(); >>>> >>>> binfmt = mm->binfmt; >>>> if (!binfmt || !binfmt->core_dump) >>>> >>> FYI, I tested kernel 5.10.59 + backport 06af8679449d + the patch above >>> with my io_uring program. The coredump locked up even when writing the >>> core file directly to disk; the zombie process could not be killed with >>> "kill -9". Unfortunately I can't test with newer kernels without >>> spending some time on it, and I am too busy with other stuff right now. >> >> That sounds like 5.10-stable is missing some of the cancelation >> backports, and your setup makes the cancelation stall because of that. >> Need to go over the 11/12/13 fixes and ensure that we've got everything >> we need for those stable versions, particularly 5.10. >> >>> My io_uring program does async buffered reads >>> (io_uring_prep_read()/io_uring_prep_readv()) from a raw disk partition >>> (no filesystem). One thread submits I/Os while another thread calls >>> io_uring_wait_cqe() and processes the completions. To trigger the >>> coredump, I added an intentional abort() in the thread that submits I/Os >>> after running for a second. >> >> OK, so that one is also using task_work for the retry based async >> buffered reads, so it makes sense. >> >> Maybe a temporary work-around is to use 06af8679449d and eliminate the >> pipe based coredump? > > Another approach - don't allow TWA_SIGNAL task_work to get queued if > PF_SIGNALED has been set on the task. This is similar to how we reject > task_work_add() on process exit, and the callers must be able to handle > that already. > > Can you test this one on top of your 5.10-stable? > > > diff --git a/fs/coredump.c b/fs/coredump.c > index 07afb5ddb1c4..ca7c1ee44ada 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) > .mm_flags = mm->flags, > }; > > + /* > + * task_work_add() will refuse to add work after PF_SIGNALED has > + * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work > + * if any was queued before that. > + */ > + if (test_thread_flag(TIF_NOTIFY_SIGNAL)) > + tracehook_notify_signal(); > + > audit_core_dumps(siginfo->si_signo); > > binfmt = mm->binfmt; > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 1698fbe6f0e1..1ab28904adc4 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, > head = READ_ONCE(task->task_works); > if (unlikely(head == &work_exited)) > return -ESRCH; > + /* > + * TIF_NOTIFY_SIGNAL notifications will interfere with > + * a core dump in progress, reject them. > + */ > + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) > + return -ESRCH; > work->next = head; > } while (cmpxchg(&task->task_works, head, work) != head); > > Olivier, I sent a 5.10 version for Nathan, any chance you can test this one for the current kernels? Basically this one should work for 5.11+, and the later 5.10 version is just for 5.10. I'm going to send it out separately for review. I do think this is the right solution, barring a tweak maybe on testing notify == TWA_SIGNAL first before digging into the task struct. But the principle is sound, and it'll work for other users of TWA_SIGNAL as well. None right now as far as I can tell, but the live patching is switching to TIF_NOTIFY_SIGNAL as well which will also cause issues with coredumps potentially.
> Olivier, I sent a 5.10 version for Nathan, any chance you can test this
^^^^^^
Tony of course, my apologies.
On 8/17/21 6:05 PM, Jens Axboe wrote: > On 8/17/21 3:39 PM, Tony Battersby wrote: >> On 8/17/21 5:28 PM, Jens Axboe wrote: >>> Another approach - don't allow TWA_SIGNAL task_work to get queued if >>> PF_SIGNALED has been set on the task. This is similar to how we reject >>> task_work_add() on process exit, and the callers must be able to handle >>> that already. >>> >>> Can you test this one on top of your 5.10-stable? >>> >>> >>> diff --git a/fs/coredump.c b/fs/coredump.c >>> index 07afb5ddb1c4..ca7c1ee44ada 100644 >>> --- a/fs/coredump.c >>> +++ b/fs/coredump.c >>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) >>> .mm_flags = mm->flags, >>> }; >>> >>> + /* >>> + * task_work_add() will refuse to add work after PF_SIGNALED has >>> + * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work >>> + * if any was queued before that. >>> + */ >>> + if (test_thread_flag(TIF_NOTIFY_SIGNAL)) >>> + tracehook_notify_signal(); >>> + >>> audit_core_dumps(siginfo->si_signo); >>> >>> binfmt = mm->binfmt; >>> diff --git a/kernel/task_work.c b/kernel/task_work.c >>> index 1698fbe6f0e1..1ab28904adc4 100644 >>> --- a/kernel/task_work.c >>> +++ b/kernel/task_work.c >>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, >>> head = READ_ONCE(task->task_works); >>> if (unlikely(head == &work_exited)) >>> return -ESRCH; >>> + /* >>> + * TIF_NOTIFY_SIGNAL notifications will interfere with >>> + * a core dump in progress, reject them. >>> + */ >>> + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) >>> + return -ESRCH; >>> work->next = head; >>> } while (cmpxchg(&task->task_works, head, work) != head); >>> >>> >> Doesn't compile. 5.10 doesn't have TIF_NOTIFY_SIGNAL. > Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally > untested... > > diff --git a/fs/coredump.c b/fs/coredump.c > index c6acfc694f65..9e899ce67589 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo) > .mm_flags = mm->flags, > }; > > + /* > + * task_work_add() will refuse to add work after PF_SIGNALED has > + * been set, ensure that we flush any pending TWA_SIGNAL work > + * if any was queued before that. > + */ > + if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) { > + task_work_run(); > + spin_lock_irq(¤t->sighand->siglock); > + current->jobctl &= ~JOBCTL_TASK_WORK; > + recalc_sigpending(); > + spin_unlock_irq(¤t->sighand->siglock); > + } > + > audit_core_dumps(siginfo->si_signo); > > binfmt = mm->binfmt; > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 8d6e1217c451..93b3f262eb4a 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, > head = READ_ONCE(task->task_works); > if (unlikely(head == &work_exited)) > return -ESRCH; > + /* > + * TWA_SIGNAL notifications will interfere with > + * a core dump in progress, reject them. > + */ > + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) > + return -ESRCH; > work->next = head; > } while (cmpxchg(&task->task_works, head, work) != head); > > Tested with 5.10.59 + backport 06af8679449d + the patch above. That fixes it for me. I tested a couple of variations to make sure. Thanks! Tested-by: Tony Battersby <tonyb@cybernetics.com>
On 8/18/21 8:37 AM, Tony Battersby wrote: > On 8/17/21 6:05 PM, Jens Axboe wrote: >> On 8/17/21 3:39 PM, Tony Battersby wrote: >>> On 8/17/21 5:28 PM, Jens Axboe wrote: >>>> Another approach - don't allow TWA_SIGNAL task_work to get queued if >>>> PF_SIGNALED has been set on the task. This is similar to how we reject >>>> task_work_add() on process exit, and the callers must be able to handle >>>> that already. >>>> >>>> Can you test this one on top of your 5.10-stable? >>>> >>>> >>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>> index 07afb5ddb1c4..ca7c1ee44ada 100644 >>>> --- a/fs/coredump.c >>>> +++ b/fs/coredump.c >>>> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) >>>> .mm_flags = mm->flags, >>>> }; >>>> >>>> + /* >>>> + * task_work_add() will refuse to add work after PF_SIGNALED has >>>> + * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work >>>> + * if any was queued before that. >>>> + */ >>>> + if (test_thread_flag(TIF_NOTIFY_SIGNAL)) >>>> + tracehook_notify_signal(); >>>> + >>>> audit_core_dumps(siginfo->si_signo); >>>> >>>> binfmt = mm->binfmt; >>>> diff --git a/kernel/task_work.c b/kernel/task_work.c >>>> index 1698fbe6f0e1..1ab28904adc4 100644 >>>> --- a/kernel/task_work.c >>>> +++ b/kernel/task_work.c >>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, >>>> head = READ_ONCE(task->task_works); >>>> if (unlikely(head == &work_exited)) >>>> return -ESRCH; >>>> + /* >>>> + * TIF_NOTIFY_SIGNAL notifications will interfere with >>>> + * a core dump in progress, reject them. >>>> + */ >>>> + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) >>>> + return -ESRCH; >>>> work->next = head; >>>> } while (cmpxchg(&task->task_works, head, work) != head); >>>> >>>> >>> Doesn't compile. 5.10 doesn't have TIF_NOTIFY_SIGNAL. >> Oh right... Here's one hacked up for the 5.10 TWA_SIGNAL setup. Totally >> untested... >> >> diff --git a/fs/coredump.c b/fs/coredump.c >> index c6acfc694f65..9e899ce67589 100644 >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -603,6 +603,19 @@ void do_coredump(const kernel_siginfo_t *siginfo) >> .mm_flags = mm->flags, >> }; >> >> + /* >> + * task_work_add() will refuse to add work after PF_SIGNALED has >> + * been set, ensure that we flush any pending TWA_SIGNAL work >> + * if any was queued before that. >> + */ >> + if (signal_pending(current) && (current->jobctl & JOBCTL_TASK_WORK)) { >> + task_work_run(); >> + spin_lock_irq(¤t->sighand->siglock); >> + current->jobctl &= ~JOBCTL_TASK_WORK; >> + recalc_sigpending(); >> + spin_unlock_irq(¤t->sighand->siglock); >> + } >> + >> audit_core_dumps(siginfo->si_signo); >> >> binfmt = mm->binfmt; >> diff --git a/kernel/task_work.c b/kernel/task_work.c >> index 8d6e1217c451..93b3f262eb4a 100644 >> --- a/kernel/task_work.c >> +++ b/kernel/task_work.c >> @@ -39,6 +39,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work, >> head = READ_ONCE(task->task_works); >> if (unlikely(head == &work_exited)) >> return -ESRCH; >> + /* >> + * TWA_SIGNAL notifications will interfere with >> + * a core dump in progress, reject them. >> + */ >> + if ((task->flags & PF_SIGNALED) && notify == TWA_SIGNAL) >> + return -ESRCH; >> work->next = head; >> } while (cmpxchg(&task->task_works, head, work) != head); >> >> > Tested with 5.10.59 + backport 06af8679449d + the patch above. That > fixes it for me. I tested a couple of variations to make sure. > > Thanks! > > Tested-by: Tony Battersby <tonyb@cybernetics.com> Great, thanks for testing! The 5.10 version is a bit uglier due to how TWA_SIGNAL used to work, but it's the most straight forward backport of the other version I sent.
On Tue, 2021-08-17 at 12:15 -0600, Jens Axboe wrote: > > It does indeed sound like it's TIF_NOTIFY_SIGNAL that will trigger > some > signal_pending() and cause an interruption of the core dump. Just out > of > curiosity, what is your /proc/sys/kernel/core_pattern set to? If it's > set to some piped process, can you try and set it to 'core' and see > if > that eliminates the truncation of the core dumps for your case? > /proc/sys/kernel/core_pattern is set to: |/home/lano1106/bin/pipe_core.sh %e %p It normally points to systemd coredump module. I have pointed to a simpler program for debugging purposes. when core_pattern points to a local file, core dump files are just fine. That was the whole point of commit 06af8679449d ("coredump: Limit what can interrupt coredumps") I have been distracted by other things this week but my last attempt to fix this problem appears to be successful. I will send out a small patch set shortly...
On Tue, 2021-08-17 at 12:24 -0600, Jens Axboe wrote: > > And assuming that works, then I suspect this one would fix your issue > even with a piped core dump: > > diff --git a/fs/coredump.c b/fs/coredump.c > index 07afb5ddb1c4..852737a9ccbf 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -41,6 +41,7 @@ > #include <linux/fs.h> > #include <linux/path.h> > #include <linux/timekeeping.h> > +#include <linux/io_uring.h> > > #include <linux/uaccess.h> > #include <asm/mmu_context.h> > @@ -603,6 +604,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) > }; > > audit_core_dumps(siginfo->si_signo); > + io_uring_task_cancel(); > > binfmt = mm->binfmt; > if (!binfmt || !binfmt->core_dump) > That is what my patch is doing. Function call is inserted at a different place... I am not sure if one location is better than the other or if it matters at all but there is an extra change required to make it work... diff --git a/fs/coredump.c b/fs/coredump.c index 07afb5ddb1c4..614fe7a54c1a 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -41,6 +41,7 @@ #include <linux/fs.h> #include <linux/path.h> #include <linux/timekeeping.h> +#include <linux/io_uring.h> #include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -625,6 +626,8 @@ void do_coredump(const kernel_siginfo_t *siginfo) need_suid_safe = true; } + io_uring_task_cancel(); + retval = coredump_wait(siginfo->si_signo, &core_state); if (retval < 0) goto fail_creds;
On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote: > > Olivier, I sent a 5.10 version for Nathan, any chance you can test > this > one for the current kernels? Basically this one should work for > 5.11+, > and the later 5.10 version is just for 5.10. I'm going to send it out > separately for review. > > I do think this is the right solution, barring a tweak maybe on > testing > notify == TWA_SIGNAL first before digging into the task struct. But > the > principle is sound, and it'll work for other users of TWA_SIGNAL as > well. None right now as far as I can tell, but the live patching is > switching to TIF_NOTIFY_SIGNAL as well which will also cause issues > with > coredumps potentially. > Ok, I am going to give it a shot. This solution is probably superior to the previous attempt as it does not inject io_uring dependency into the coredump module. The small extra change that I alluded to in my previous reply will still be relevant even if we go with your patch... I'll come back soon with your patch testing result and my small extra change that I keep teasing about. Greetings,
On Sat, 2021-08-21 at 06:08 -0400, Olivier Langlois wrote: > On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote: > > > > Olivier, I sent a 5.10 version for Nathan, any chance you can test > > this > > one for the current kernels? Basically this one should work for > > 5.11+, > > and the later 5.10 version is just for 5.10. I'm going to send it > > out > > separately for review. > > > > I do think this is the right solution, barring a tweak maybe on > > testing > > notify == TWA_SIGNAL first before digging into the task struct. But > > the > > principle is sound, and it'll work for other users of TWA_SIGNAL as > > well. None right now as far as I can tell, but the live patching is > > switching to TIF_NOTIFY_SIGNAL as well which will also cause issues > > with > > coredumps potentially. > > > Ok, I am going to give it a shot. This solution is probably superior > to > the previous attempt as it does not inject io_uring dependency into > the > coredump module. > > The small extra change that I alluded to in my previous reply will > still be relevant even if we go with your patch... > > I'll come back soon with your patch testing result and my small extra > change that I keep teasing about. > > Greetings, > Jens, your patch doesn't compile with 5.12+. AFAIK, the reason is that JOBCTL_TASK_WORK is gone. Wouldn't just a call to tracehook_notify_signal from do_coredump be enough and backward compatible with every possible stable branches? Greetings,
On 8/21/21 10:47 AM, Olivier Langlois wrote: > On Sat, 2021-08-21 at 06:08 -0400, Olivier Langlois wrote: >> On Tue, 2021-08-17 at 20:57 -0600, Jens Axboe wrote: >>> >>> Olivier, I sent a 5.10 version for Nathan, any chance you can test >>> this >>> one for the current kernels? Basically this one should work for >>> 5.11+, >>> and the later 5.10 version is just for 5.10. I'm going to send it >>> out >>> separately for review. >>> >>> I do think this is the right solution, barring a tweak maybe on >>> testing >>> notify == TWA_SIGNAL first before digging into the task struct. But >>> the >>> principle is sound, and it'll work for other users of TWA_SIGNAL as >>> well. None right now as far as I can tell, but the live patching is >>> switching to TIF_NOTIFY_SIGNAL as well which will also cause issues >>> with >>> coredumps potentially. >>> >> Ok, I am going to give it a shot. This solution is probably superior >> to >> the previous attempt as it does not inject io_uring dependency into >> the >> coredump module. >> >> The small extra change that I alluded to in my previous reply will >> still be relevant even if we go with your patch... >> >> I'll come back soon with your patch testing result and my small extra >> change that I keep teasing about. >> >> Greetings, >> > Jens, > > your patch doesn't compile with 5.12+. AFAIK, the reason is that > JOBCTL_TASK_WORK is gone. > > Wouldn't just a call to tracehook_notify_signal from do_coredump be > enough and backward compatible with every possible stable branches? That version is just for 5.10, the first I posted is applicable to 5.11+
On Sat, 2021-08-21 at 10:51 -0600, Jens Axboe wrote: > On 8/21/21 10:47 AM, Olivier Langlois wrote: > > Jens, > > > > your patch doesn't compile with 5.12+. AFAIK, the reason is that > > JOBCTL_TASK_WORK is gone. > > > > Wouldn't just a call to tracehook_notify_signal from do_coredump be > > enough and backward compatible with every possible stable branches? > > That version is just for 5.10, the first I posted is applicable to > 5.11+ > Ok, in that case, I can tell you that it partially works. There is a small thing missing. I have tested mine which I did share in https://lore.kernel.org/io-uring/87pmwt6biw.fsf@disp2133/T/#m3b51d44c12e39a06b82aac1d372df05312cff833 and with another small patch added to it, it does work. I will offer my patch later today.
diff --git a/fs/coredump.c b/fs/coredump.c index 2868e3e171ae..c3d8fc14b993 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -519,7 +519,7 @@ static bool dump_interrupted(void) * but then we need to teach dump_write() to restart and clear * TIF_SIGPENDING. */ - return signal_pending(current); + return fatal_signal_pending(current) || freezing(current); } static void wait_for_dump_helpers(struct file *file)