Message ID | AM6PR03MB5170285B336790D3450E2644E4E40@AM6PR03MB5170.eurprd03.prod.outlook.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [PATCHv5] exec: Fix a deadlock in ptrace | expand |
Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > This fixes a deadlock in the tracer when tracing a multi-threaded > application that calls execve while more than one thread are running. > > I observed that when running strace on the gcc test suite, it always > blocks after a while, when expect calls execve, because other threads > have to be terminated. They send ptrace events, but the strace is no > longer able to respond, since it is blocked in vm_access. > > The deadlock is always happening when strace needs to access the > tracees process mmap, while another thread in the tracee starts to > execve a child process, but that cannot continue until the > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: A couple of things. Why do we think it is safe to change the behavior exposed to userspace? Not the deadlock but all of the times the current code would not deadlock? Especially given that this is a small window it might be hard for people to track down and report so we need a strong argument that this won't break existing userspace before we just change things. Usually surveying all of the users of a system call that we can find and checking to see if they might be affected by the change in behavior is difficult enough that we usually opt for not being lazy and preserving the behavior. This patch is up to two changes in behavior now, that could potentially affect a whole array of programs. Adding linux-api so that this change in behavior can be documented if/when this change goes through. If you can split the documentation and test fixes out into separate patches that would help reviewing this code, or please make it explicit that the your are changing documentation about behavior that is changing with this patch. Eric > diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c > new file mode 100644 > index 0000000..6d8a048 > --- /dev/null > +++ b/tools/testing/selftests/ptrace/vmaccess.c > @@ -0,0 +1,66 @@ > +// SPDX-License-Identifier: GPL-2.0+ > +/* > + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> > + * All rights reserved. > + * > + * Check whether /proc/$pid/mem can be accessed without causing deadlocks > + * when de_thread is blocked with ->cred_guard_mutex held. > + */ > + > +#include "../kselftest_harness.h" > +#include <stdio.h> > +#include <fcntl.h> > +#include <pthread.h> > +#include <signal.h> > +#include <unistd.h> > +#include <sys/ptrace.h> > + > +static void *thread(void *arg) > +{ > + ptrace(PTRACE_TRACEME, 0, 0L, 0L); > + return NULL; > +} > + > +TEST(vmaccess) > +{ > + int f, pid = fork(); > + char mm[64]; > + > + if (!pid) { > + pthread_t pt; > + > + pthread_create(&pt, NULL, thread, NULL); > + pthread_join(pt, NULL); > + execlp("true", "true", NULL); > + } > + > + sleep(1); > + sprintf(mm, "/proc/%d/mem", pid); > + f = open(mm, O_RDONLY); > + ASSERT_LE(0, f); > + close(f); > + f = kill(pid, SIGCONT); > + ASSERT_EQ(0, f); > +} > + > +TEST(attach) > +{ > + int f, pid = fork(); > + > + if (!pid) { > + pthread_t pt; > + > + pthread_create(&pt, NULL, thread, NULL); > + pthread_join(pt, NULL); > + execlp("true", "true", NULL); > + } > + > + sleep(1); > + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN. Because that is pretty much what any self respecting user space process will do. At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock. > + ASSERT_EQ(EAGAIN, errno); > + ASSERT_EQ(f, -1); > + f = kill(pid, SIGCONT); > + ASSERT_EQ(0, f); > +} > + > +TEST_HARNESS_MAIN Eric
On 3/3/20 4:18 PM, Eric W. Biederman wrote: > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > >> This fixes a deadlock in the tracer when tracing a multi-threaded >> application that calls execve while more than one thread are running. >> >> I observed that when running strace on the gcc test suite, it always >> blocks after a while, when expect calls execve, because other threads >> have to be terminated. They send ptrace events, but the strace is no >> longer able to respond, since it is blocked in vm_access. >> >> The deadlock is always happening when strace needs to access the >> tracees process mmap, while another thread in the tracee starts to >> execve a child process, but that cannot continue until the >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > > A couple of things. > > Why do we think it is safe to change the behavior exposed to userspace? > Not the deadlock but all of the times the current code would not > deadlock? > > Especially given that this is a small window it might be hard for people > to track down and report so we need a strong argument that this won't > break existing userspace before we just change things. > Hmm, I tend to agree. > Usually surveying all of the users of a system call that we can find > and checking to see if they might be affected by the change in behavior > is difficult enough that we usually opt for not being lazy and > preserving the behavior. > > This patch is up to two changes in behavior now, that could potentially > affect a whole array of programs. Adding linux-api so that this change > in behavior can be documented if/when this change goes through. > One is PTRACE_ACCESS possibly returning EAGAIN, yes. We could try to restrict that behavior change to when any thread is ptraced when execve starts, can't be too complicated. But the other is only SYS_seccomp returning EAGAIN, when a different thread of the current process is calling execve at the same time. I would consider it completely impossible to have any user-visual effect, since de_thread is just terminating all threads, including the thread where the -EAGAIN was returned, so we will never know what happened. > If you can split the documentation and test fixes out into separate > patches that would help reviewing this code, or please make it explicit > that the your are changing documentation about behavior that is changing > with this patch. > I am not sure if I have touched the right user documentation. I only saw a document referring to a non-existent "current->cred_replace_mutex" I haven't digged the git history, but that must be pre-historic IMHO. It appears to me that is some developer documentation, but it's nevertheless worth to keep up to date when the code changes. So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ? Bernd. > Eric > >> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c >> new file mode 100644 >> index 0000000..6d8a048 >> --- /dev/null >> +++ b/tools/testing/selftests/ptrace/vmaccess.c >> @@ -0,0 +1,66 @@ >> +// SPDX-License-Identifier: GPL-2.0+ >> +/* >> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> >> + * All rights reserved. >> + * >> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks >> + * when de_thread is blocked with ->cred_guard_mutex held. >> + */ >> + >> +#include "../kselftest_harness.h" >> +#include <stdio.h> >> +#include <fcntl.h> >> +#include <pthread.h> >> +#include <signal.h> >> +#include <unistd.h> >> +#include <sys/ptrace.h> >> + >> +static void *thread(void *arg) >> +{ >> + ptrace(PTRACE_TRACEME, 0, 0L, 0L); >> + return NULL; >> +} >> + >> +TEST(vmaccess) >> +{ >> + int f, pid = fork(); >> + char mm[64]; >> + >> + if (!pid) { >> + pthread_t pt; >> + >> + pthread_create(&pt, NULL, thread, NULL); >> + pthread_join(pt, NULL); >> + execlp("true", "true", NULL); >> + } >> + >> + sleep(1); >> + sprintf(mm, "/proc/%d/mem", pid); >> + f = open(mm, O_RDONLY); >> + ASSERT_LE(0, f); >> + close(f); >> + f = kill(pid, SIGCONT); >> + ASSERT_EQ(0, f); >> +} >> + >> +TEST(attach) >> +{ >> + int f, pid = fork(); >> + >> + if (!pid) { >> + pthread_t pt; >> + >> + pthread_create(&pt, NULL, thread, NULL); >> + pthread_join(pt, NULL); >> + execlp("true", "true", NULL); >> + } >> + >> + sleep(1); >> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); > > To be meaningful this code needs to learn to loop when > ptrace returns -EAGAIN. > > Because that is pretty much what any self respecting user space > process will do. > > At which point I am not certain we can say that the behavior has > sufficiently improved not to be a deadlock. > In this special dead-duck test it won't work, but it would still be lots more transparent what is going on, since previously you had two zombie process, and no way to even output debug messages, which also all self respecting user space processes should do. So yes, I can at least give a good example and re-try it several times together with wait4 which a tracer is expected to do. Bernd. >> + ASSERT_EQ(EAGAIN, errno); >> + ASSERT_EQ(f, -1); >> + f = kill(pid, SIGCONT); >> + ASSERT_EQ(0, f); >> +} >> + >> +TEST_HARNESS_MAIN > > Eric >
On Tue, Mar 03, 2020 at 09:18:44AM -0600, Eric W. Biederman wrote: > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > > > This fixes a deadlock in the tracer when tracing a multi-threaded > > application that calls execve while more than one thread are running. > > > > I observed that when running strace on the gcc test suite, it always > > blocks after a while, when expect calls execve, because other threads > > have to be terminated. They send ptrace events, but the strace is no > > longer able to respond, since it is blocked in vm_access. > > > > The deadlock is always happening when strace needs to access the > > tracees process mmap, while another thread in the tracee starts to > > execve a child process, but that cannot continue until the > > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > > A couple of things. > > Why do we think it is safe to change the behavior exposed to userspace? > Not the deadlock but all of the times the current code would not > deadlock? > > Especially given that this is a small window it might be hard for people > to track down and report so we need a strong argument that this won't > break existing userspace before we just change things. > > Usually surveying all of the users of a system call that we can find > and checking to see if they might be affected by the change in behavior > is difficult enough that we usually opt for not being lazy and > preserving the behavior. > > This patch is up to two changes in behavior now, that could potentially > affect a whole array of programs. Adding linux-api so that this change > in behavior can be documented if/when this change goes through. > > If you can split the documentation and test fixes out into separate > patches that would help reviewing this code, or please make it explicit > that the your are changing documentation about behavior that is changing > with this patch. Agreed. I think it'd be good to do it in three patches: 1. unrelated documentation update 2. fix + documentation changes specific to the fix 3. test(s) Christian
On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote: > On 3/3/20 4:18 PM, Eric W. Biederman wrote: > > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > > > >> This fixes a deadlock in the tracer when tracing a multi-threaded > >> application that calls execve while more than one thread are running. > >> > >> I observed that when running strace on the gcc test suite, it always > >> blocks after a while, when expect calls execve, because other threads > >> have to be terminated. They send ptrace events, but the strace is no > >> longer able to respond, since it is blocked in vm_access. > >> > >> The deadlock is always happening when strace needs to access the > >> tracees process mmap, while another thread in the tracee starts to > >> execve a child process, but that cannot continue until the > >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > > > > A couple of things. > > > > Why do we think it is safe to change the behavior exposed to userspace? > > Not the deadlock but all of the times the current code would not > > deadlock? > > > > Especially given that this is a small window it might be hard for people > > to track down and report so we need a strong argument that this won't > > break existing userspace before we just change things. > > > > Hmm, I tend to agree. > > > Usually surveying all of the users of a system call that we can find > > and checking to see if they might be affected by the change in behavior > > is difficult enough that we usually opt for not being lazy and > > preserving the behavior. > > > > This patch is up to two changes in behavior now, that could potentially > > affect a whole array of programs. Adding linux-api so that this change > > in behavior can be documented if/when this change goes through. > > > > One is PTRACE_ACCESS possibly returning EAGAIN, yes. > > We could try to restrict that behavior change to when any > thread is ptraced when execve starts, can't be too complicated. > > > But the other is only SYS_seccomp returning EAGAIN, when a different > thread of the current process is calling execve at the same time. > > I would consider it completely impossible to have any user-visual effect, > since de_thread is just terminating all threads, including the thread > where the -EAGAIN was returned, so we will never know what happened. I think if we risk a user-space facing change we should try the simple thing first before making the fix more convoluted? But it's a tough call... > > > > If you can split the documentation and test fixes out into separate > > patches that would help reviewing this code, or please make it explicit > > that the your are changing documentation about behavior that is changing > > with this patch. > > > > I am not sure if I have touched the right user documentation. > > I only saw a document referring to a non-existent "current->cred_replace_mutex" > I haven't digged the git history, but that must be pre-historic IMHO. > It appears to me that is some developer documentation, but it's nevertheless > worth to keep up to date when the code changes. > > So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ? Since that would be a potentially user-visible change it would make the most sense to add it to man ptrace(2) if/when we land this change. For developers, placing a comment in kernel/ptrace.c:ptrace_attach() would make the most sense? We already have something about exec protection in there. Christian
On Tue, Mar 03, 2020 at 06:01:11PM +0100, Christian Brauner wrote: > On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote: > > On 3/3/20 4:18 PM, Eric W. Biederman wrote: > > > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > > > > > >> This fixes a deadlock in the tracer when tracing a multi-threaded > > >> application that calls execve while more than one thread are running. > > >> > > >> I observed that when running strace on the gcc test suite, it always > > >> blocks after a while, when expect calls execve, because other threads > > >> have to be terminated. They send ptrace events, but the strace is no > > >> longer able to respond, since it is blocked in vm_access. > > >> > > >> The deadlock is always happening when strace needs to access the > > >> tracees process mmap, while another thread in the tracee starts to > > >> execve a child process, but that cannot continue until the > > >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: > > > > > > A couple of things. > > > > > > Why do we think it is safe to change the behavior exposed to userspace? > > > Not the deadlock but all of the times the current code would not > > > deadlock? > > > > > > Especially given that this is a small window it might be hard for people > > > to track down and report so we need a strong argument that this won't > > > break existing userspace before we just change things. > > > > > > > Hmm, I tend to agree. > > > > > Usually surveying all of the users of a system call that we can find > > > and checking to see if they might be affected by the change in behavior > > > is difficult enough that we usually opt for not being lazy and > > > preserving the behavior. > > > > > > This patch is up to two changes in behavior now, that could potentially > > > affect a whole array of programs. Adding linux-api so that this change > > > in behavior can be documented if/when this change goes through. > > > > > > > One is PTRACE_ACCESS possibly returning EAGAIN, yes. > > > > We could try to restrict that behavior change to when any > > thread is ptraced when execve starts, can't be too complicated. > > > > > > But the other is only SYS_seccomp returning EAGAIN, when a different > > thread of the current process is calling execve at the same time. > > > > I would consider it completely impossible to have any user-visual effect, > > since de_thread is just terminating all threads, including the thread > > where the -EAGAIN was returned, so we will never know what happened. > > I think if we risk a user-space facing change we should try the simple > thing first before making the fix more convoluted? But it's a tough > call... Actually, to get a _rough_ estimate of the possible impact I would recommend you run the criu test suite (and possible the strace test-suite) on a kernel with and without your fix. That's what I tend to do when I touch code I fear will have impact on APIs that very deeply touch core kernel. Criu's test-suite makes heavy use of ptrace and usually runs into a bunch of interesting (exec) races too, and does have tests for handling zombies processes etc. pp. Should be relatively simple: create a vm and then criu build-dependencies, git clone criu; cd criu; make; cd test; ./zdtm.py run -a --keep-going If your system doesn't support Selinux properly, you need to disable it when running the tests and you also need to make sure that you're using python3 or change the shebang in zdtm.py to python3. Just a recommendation. Christian
Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > On 3/3/20 4:18 PM, Eric W. Biederman wrote: >> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c >>> new file mode 100644 >>> index 0000000..6d8a048 >>> --- /dev/null >>> +++ b/tools/testing/selftests/ptrace/vmaccess.c >>> @@ -0,0 +1,66 @@ >>> +// SPDX-License-Identifier: GPL-2.0+ >>> +/* >>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> >>> + * All rights reserved. >>> + * >>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks >>> + * when de_thread is blocked with ->cred_guard_mutex held. >>> + */ >>> + >>> +#include "../kselftest_harness.h" >>> +#include <stdio.h> >>> +#include <fcntl.h> >>> +#include <pthread.h> >>> +#include <signal.h> >>> +#include <unistd.h> >>> +#include <sys/ptrace.h> >>> + >>> +static void *thread(void *arg) >>> +{ >>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L); >>> + return NULL; >>> +} >>> + >>> +TEST(vmaccess) >>> +{ >>> + int f, pid = fork(); >>> + char mm[64]; >>> + >>> + if (!pid) { >>> + pthread_t pt; >>> + >>> + pthread_create(&pt, NULL, thread, NULL); >>> + pthread_join(pt, NULL); >>> + execlp("true", "true", NULL); >>> + } >>> + >>> + sleep(1); >>> + sprintf(mm, "/proc/%d/mem", pid); >>> + f = open(mm, O_RDONLY); >>> + ASSERT_LE(0, f); >>> + close(f); >>> + f = kill(pid, SIGCONT); >>> + ASSERT_EQ(0, f); >>> +} >>> + >>> +TEST(attach) >>> +{ >>> + int f, pid = fork(); >>> + >>> + if (!pid) { >>> + pthread_t pt; >>> + >>> + pthread_create(&pt, NULL, thread, NULL); >>> + pthread_join(pt, NULL); >>> + execlp("true", "true", NULL); >>> + } >>> + >>> + sleep(1); >>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); >> >> To be meaningful this code needs to learn to loop when >> ptrace returns -EAGAIN. >> >> Because that is pretty much what any self respecting user space >> process will do. >> >> At which point I am not certain we can say that the behavior has >> sufficiently improved not to be a deadlock. >> > > In this special dead-duck test it won't work, but it would > still be lots more transparent what is going on, since previously > you had two zombie process, and no way to even output debug > messages, which also all self respecting user space processes > should do. Agreed it is more transparent. So if you are going to deadlock it is better. My previous proposal (which I admit is more work to implement) would actually allow succeeding in this case and so it would not be subject to a dead lock (even via -EGAIN) at this point. > So yes, I can at least give a good example and re-try it several > times together with wait4 which a tracer is expected to do. Thank you, Eric
On 3/3/20 9:08 PM, Eric W. Biederman wrote: > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > >> On 3/3/20 4:18 PM, Eric W. Biederman wrote: >>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c >>>> new file mode 100644 >>>> index 0000000..6d8a048 >>>> --- /dev/null >>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c >>>> @@ -0,0 +1,66 @@ >>>> +// SPDX-License-Identifier: GPL-2.0+ >>>> +/* >>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> >>>> + * All rights reserved. >>>> + * >>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks >>>> + * when de_thread is blocked with ->cred_guard_mutex held. >>>> + */ >>>> + >>>> +#include "../kselftest_harness.h" >>>> +#include <stdio.h> >>>> +#include <fcntl.h> >>>> +#include <pthread.h> >>>> +#include <signal.h> >>>> +#include <unistd.h> >>>> +#include <sys/ptrace.h> >>>> + >>>> +static void *thread(void *arg) >>>> +{ >>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L); >>>> + return NULL; >>>> +} >>>> + >>>> +TEST(vmaccess) >>>> +{ >>>> + int f, pid = fork(); >>>> + char mm[64]; >>>> + >>>> + if (!pid) { >>>> + pthread_t pt; >>>> + >>>> + pthread_create(&pt, NULL, thread, NULL); >>>> + pthread_join(pt, NULL); >>>> + execlp("true", "true", NULL); >>>> + } >>>> + >>>> + sleep(1); >>>> + sprintf(mm, "/proc/%d/mem", pid); >>>> + f = open(mm, O_RDONLY); >>>> + ASSERT_LE(0, f); >>>> + close(f); >>>> + f = kill(pid, SIGCONT); >>>> + ASSERT_EQ(0, f); >>>> +} >>>> + >>>> +TEST(attach) >>>> +{ >>>> + int f, pid = fork(); >>>> + >>>> + if (!pid) { >>>> + pthread_t pt; >>>> + >>>> + pthread_create(&pt, NULL, thread, NULL); >>>> + pthread_join(pt, NULL); >>>> + execlp("true", "true", NULL); >>>> + } >>>> + >>>> + sleep(1); >>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); >>> >>> To be meaningful this code needs to learn to loop when >>> ptrace returns -EAGAIN. >>> >>> Because that is pretty much what any self respecting user space >>> process will do. >>> >>> At which point I am not certain we can say that the behavior has >>> sufficiently improved not to be a deadlock. >>> >> >> In this special dead-duck test it won't work, but it would >> still be lots more transparent what is going on, since previously >> you had two zombie process, and no way to even output debug >> messages, which also all self respecting user space processes >> should do. > > Agreed it is more transparent. So if you are going to deadlock > it is better. > > My previous proposal (which I admit is more work to implement) would > actually allow succeeding in this case and so it would not be subject to > a dead lock (even via -EGAIN) at this point. > >> So yes, I can at least give a good example and re-try it several >> times together with wait4 which a tracer is expected to do. > > Thank you, > > Eric > Okay, I think it can be done with minimal API changes, but it needs two mutexes, one that guards the execve, and one that guards only the credentials. If no traced sibling thread exists, the mutexes are used this way: lock(exec_guard_mutex) cred_locked_in_execve = true; de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex) so effectively no API change at all. If a traced sibling thread exists, the mutexes are used differently: lock(exec_guard_mutex) cred_locked_in_execve = true; unlock(exec_guard_mutex) de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) lock(exec_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex) Only the case changes that would deadlock anyway. Bernd.
Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > On 3/3/20 9:08 PM, Eric W. Biederman wrote: >> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >> >>> On 3/3/20 4:18 PM, Eric W. Biederman wrote: >>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c >>>>> new file mode 100644 >>>>> index 0000000..6d8a048 >>>>> --- /dev/null >>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c >>>>> @@ -0,0 +1,66 @@ >>>>> +// SPDX-License-Identifier: GPL-2.0+ >>>>> +/* >>>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> >>>>> + * All rights reserved. >>>>> + * >>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks >>>>> + * when de_thread is blocked with ->cred_guard_mutex held. >>>>> + */ >>>>> + >>>>> +#include "../kselftest_harness.h" >>>>> +#include <stdio.h> >>>>> +#include <fcntl.h> >>>>> +#include <pthread.h> >>>>> +#include <signal.h> >>>>> +#include <unistd.h> >>>>> +#include <sys/ptrace.h> >>>>> + >>>>> +static void *thread(void *arg) >>>>> +{ >>>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L); >>>>> + return NULL; >>>>> +} >>>>> + >>>>> +TEST(vmaccess) >>>>> +{ >>>>> + int f, pid = fork(); >>>>> + char mm[64]; >>>>> + >>>>> + if (!pid) { >>>>> + pthread_t pt; >>>>> + >>>>> + pthread_create(&pt, NULL, thread, NULL); >>>>> + pthread_join(pt, NULL); >>>>> + execlp("true", "true", NULL); >>>>> + } >>>>> + >>>>> + sleep(1); >>>>> + sprintf(mm, "/proc/%d/mem", pid); >>>>> + f = open(mm, O_RDONLY); >>>>> + ASSERT_LE(0, f); >>>>> + close(f); >>>>> + f = kill(pid, SIGCONT); >>>>> + ASSERT_EQ(0, f); >>>>> +} >>>>> + >>>>> +TEST(attach) >>>>> +{ >>>>> + int f, pid = fork(); >>>>> + >>>>> + if (!pid) { >>>>> + pthread_t pt; >>>>> + >>>>> + pthread_create(&pt, NULL, thread, NULL); >>>>> + pthread_join(pt, NULL); >>>>> + execlp("true", "true", NULL); >>>>> + } >>>>> + >>>>> + sleep(1); >>>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); >>>> >>>> To be meaningful this code needs to learn to loop when >>>> ptrace returns -EAGAIN. >>>> >>>> Because that is pretty much what any self respecting user space >>>> process will do. >>>> >>>> At which point I am not certain we can say that the behavior has >>>> sufficiently improved not to be a deadlock. >>>> >>> >>> In this special dead-duck test it won't work, but it would >>> still be lots more transparent what is going on, since previously >>> you had two zombie process, and no way to even output debug >>> messages, which also all self respecting user space processes >>> should do. >> >> Agreed it is more transparent. So if you are going to deadlock >> it is better. >> >> My previous proposal (which I admit is more work to implement) would >> actually allow succeeding in this case and so it would not be subject to >> a dead lock (even via -EGAIN) at this point. >> >>> So yes, I can at least give a good example and re-try it several >>> times together with wait4 which a tracer is expected to do. >> >> Thank you, >> >> Eric >> > > Okay, I think it can be done with minimal API changes, > but it needs two mutexes, one that guards the execve, > and one that guards only the credentials. > > If no traced sibling thread exists, the mutexes are used this way: > lock(exec_guard_mutex) > cred_locked_in_execve = true; > de_thread() > lock(cred_guard_mutex) > unlock(cred_guard_mutex) > cred_locked_in_execve = false; > unlock(exec_guard_mutex) > > so effectively no API change at all. > > If a traced sibling thread exists, the mutexes are used differently: > lock(exec_guard_mutex) > cred_locked_in_execve = true; > unlock(exec_guard_mutex) > de_thread() > lock(cred_guard_mutex) > unlock(cred_guard_mutex) > lock(exec_guard_mutex) > cred_locked_in_execve = false; > unlock(exec_guard_mutex) > > Only the case changes that would deadlock anyway. Let me propose a slight alternative that I think sets us up for long term success. Leave cred_guard_mutex as is, but declare it undesirable. The cred_guard_mutex as designed really is something we should get rid of. As it it can sleep over several different userspace accesses. The copying of the exec arguments is technically as prone to deadlock as the ptrace case. Add a new mutex with a better name perhaps "exec_change_mutex" that is used to guard the changes that exec makes to a process. Then we gradually shift all the cred_guard_mutex users over to the new mutex. AKA one patch per user of cred_guard_mutex. At each patch that shifts things over we will have the opportunity to review the code to see that there no funny dependencies that were missed. I will sign up for working on the no_new_privs and ptrace_attach cases as I think I can make those happen. Especially no_new_privs. Getting the easier cases will resolve your issues and put things on a better footing. Eric
On 3/4/20 5:33 PM, Eric W. Biederman wrote: > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > >> On 3/3/20 9:08 PM, Eric W. Biederman wrote: >>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>> >>>> On 3/3/20 4:18 PM, Eric W. Biederman wrote: >>>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c >>>>>> new file mode 100644 >>>>>> index 0000000..6d8a048 >>>>>> --- /dev/null >>>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c >>>>>> @@ -0,0 +1,66 @@ >>>>>> +// SPDX-License-Identifier: GPL-2.0+ >>>>>> +/* >>>>>> + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> >>>>>> + * All rights reserved. >>>>>> + * >>>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks >>>>>> + * when de_thread is blocked with ->cred_guard_mutex held. >>>>>> + */ >>>>>> + >>>>>> +#include "../kselftest_harness.h" >>>>>> +#include <stdio.h> >>>>>> +#include <fcntl.h> >>>>>> +#include <pthread.h> >>>>>> +#include <signal.h> >>>>>> +#include <unistd.h> >>>>>> +#include <sys/ptrace.h> >>>>>> + >>>>>> +static void *thread(void *arg) >>>>>> +{ >>>>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L); >>>>>> + return NULL; >>>>>> +} >>>>>> + >>>>>> +TEST(vmaccess) >>>>>> +{ >>>>>> + int f, pid = fork(); >>>>>> + char mm[64]; >>>>>> + >>>>>> + if (!pid) { >>>>>> + pthread_t pt; >>>>>> + >>>>>> + pthread_create(&pt, NULL, thread, NULL); >>>>>> + pthread_join(pt, NULL); >>>>>> + execlp("true", "true", NULL); >>>>>> + } >>>>>> + >>>>>> + sleep(1); >>>>>> + sprintf(mm, "/proc/%d/mem", pid); >>>>>> + f = open(mm, O_RDONLY); >>>>>> + ASSERT_LE(0, f); >>>>>> + close(f); >>>>>> + f = kill(pid, SIGCONT); >>>>>> + ASSERT_EQ(0, f); >>>>>> +} >>>>>> + >>>>>> +TEST(attach) >>>>>> +{ >>>>>> + int f, pid = fork(); >>>>>> + >>>>>> + if (!pid) { >>>>>> + pthread_t pt; >>>>>> + >>>>>> + pthread_create(&pt, NULL, thread, NULL); >>>>>> + pthread_join(pt, NULL); >>>>>> + execlp("true", "true", NULL); >>>>>> + } >>>>>> + >>>>>> + sleep(1); >>>>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); >>>>> >>>>> To be meaningful this code needs to learn to loop when >>>>> ptrace returns -EAGAIN. >>>>> >>>>> Because that is pretty much what any self respecting user space >>>>> process will do. >>>>> >>>>> At which point I am not certain we can say that the behavior has >>>>> sufficiently improved not to be a deadlock. >>>>> >>>> >>>> In this special dead-duck test it won't work, but it would >>>> still be lots more transparent what is going on, since previously >>>> you had two zombie process, and no way to even output debug >>>> messages, which also all self respecting user space processes >>>> should do. >>> >>> Agreed it is more transparent. So if you are going to deadlock >>> it is better. >>> >>> My previous proposal (which I admit is more work to implement) would >>> actually allow succeeding in this case and so it would not be subject to >>> a dead lock (even via -EGAIN) at this point. >>> >>>> So yes, I can at least give a good example and re-try it several >>>> times together with wait4 which a tracer is expected to do. >>> >>> Thank you, >>> >>> Eric >>> >> >> Okay, I think it can be done with minimal API changes, >> but it needs two mutexes, one that guards the execve, >> and one that guards only the credentials. >> >> If no traced sibling thread exists, the mutexes are used this way: >> lock(exec_guard_mutex) >> cred_locked_in_execve = true; >> de_thread() >> lock(cred_guard_mutex) >> unlock(cred_guard_mutex) >> cred_locked_in_execve = false; >> unlock(exec_guard_mutex) >> >> so effectively no API change at all. >> >> If a traced sibling thread exists, the mutexes are used differently: >> lock(exec_guard_mutex) >> cred_locked_in_execve = true; >> unlock(exec_guard_mutex) >> de_thread() >> lock(cred_guard_mutex) >> unlock(cred_guard_mutex) >> lock(exec_guard_mutex) >> cred_locked_in_execve = false; >> unlock(exec_guard_mutex) >> >> Only the case changes that would deadlock anyway. > > > Let me propose a slight alternative that I think sets us up for long > term success. > > Leave cred_guard_mutex as is, but declare it undesirable. The > cred_guard_mutex as designed really is something we should get rid of. > As it it can sleep over several different userspace accesses. The > copying of the exec arguments is technically as prone to deadlock as the > ptrace case. > > Add a new mutex with a better name perhaps "exec_change_mutex" that is > used to guard the changes that exec makes to a process. > > Then we gradually shift all the cred_guard_mutex users over to the new > mutex. AKA one patch per user of cred_guard_mutex. At each patch that > shifts things over we will have the opportunity to review the code to > see that there no funny dependencies that were missed. > > I will sign up for working on the no_new_privs and ptrace_attach cases > as I think I can make those happen. Especially no_new_privs. > > Getting the easier cases will resolve your issues and put things on a > better footing. > > Eric > Okay, however I think we will need two mutexes in the long term. So currently I have reduced the cred_guard_mutex to protect just the loading of the executable code in the process vm, since that is what works for vm_access, (one of the test cases). And another mutex that protects the whole execve function, that is need for ptrace, (and seccomp). But I have only a test case for ptrace. If I understand that right, I should not recycle cred_guard_mutex but leave it as is, and create two additional mutexes which will take over step by step. Sounds reasonable, indeed. I will send an update (v6) what I have right now, but just for information, so you can see how my minimal API-Change approach works. Bernd.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..0988798 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,9 +437,14 @@ new set of credentials by calling:: struct cred *prepare_creds(void); -this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex +is acquired before this function gets called, and released after setting +current->signal->cred_locked_in_execve. The same mutex is acquired later, +while the credentials and the process mmap are actually changed, and +current->signal->cred_locked_in_execve is reset again. The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place @@ -466,9 +471,8 @@ by calling:: This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes. This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +490,7 @@ invoked:: void abort_creds(struct cred *new); -This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials. A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..5fc744e 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; + retval = mutex_lock_killable(¤t->signal->cred_guard_mutex); + if (retval) + goto out; + + bprm->called_flush_old_exec = 1; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1398,29 +1404,51 @@ void finalize_exec(struct linux_binprm *bprm) EXPORT_SYMBOL(finalize_exec); /* - * Prepare credentials and lock ->cred_guard_mutex. + * Prepare credentials and set ->cred_locked_in_execve. * install_exec_creds() commits the new creds and drops the lock. * Or, if exec fails before, free_bprm() should release ->cred and * and unlock. */ static int prepare_bprm_creds(struct linux_binprm *bprm) { + int ret; + if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) return -ERESTARTNOINTR; + ret = -EAGAIN; + if (unlikely(current->signal->cred_locked_in_execve)) + goto out; + + ret = -ENOMEM; bprm->cred = prepare_exec_creds(); - if (likely(bprm->cred)) - return 0; + if (likely(bprm->cred)) { + current->signal->cred_locked_in_execve = true; + ret = 0; + } +out: mutex_unlock(¤t->signal->cred_guard_mutex); - return -ENOMEM; + return ret; } static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { - mutex_unlock(¤t->signal->cred_guard_mutex); + /* + * If flush_old_exec did not acquire the cred_guard_mutex, + * try again here, but if that fails, just leave + * cred_locked_in_execve alone, since this means there + * must be a fatal signal pending. + * We don't want to prevent this task to be killed, just + * because it is stuck in the middle of execve. + */ + if (bprm->called_flush_old_exec || + !mutex_lock_killable(¤t->signal->cred_guard_mutex)) { + current->signal->cred_locked_in_execve = false; + mutex_unlock(¤t->signal->cred_guard_mutex); + } abort_creds(bprm->cred); } if (bprm->file) { @@ -1469,13 +1497,14 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + current->signal->cred_locked_in_execve = false; mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); /* * determine how safe it is to execute the proposed program - * - the caller must hold ->cred_guard_mutex to protect against + * - the caller must have set ->cred_locked_in_execve to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..2930253 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,11 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when the cred_guard_mutex is taken. + */ + called_flush_old_exec:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..8f8e358 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -225,6 +225,8 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */ + bool cred_locked_in_execve; /* set while in execve, only valid when + * cred_guard_mutex is held */ } __randomize_layout; /* diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..e4c78de 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -676,7 +676,7 @@ void __init cred_init(void) * * Returns the new credentials or NULL if out of memory. * - * Does not take, and does not return holding current->cred_replace_mutex. + * Does not take, and does not return holding ->cred_guard_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 43d6179..0f82bab 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request, if (mutex_lock_interruptible(&task->signal->cred_guard_mutex)) goto out; + retval = -EAGAIN; + if (task->signal->cred_locked_in_execve) + goto unlock_creds; + task_lock(task); retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); task_unlock(task); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..3efa3e5 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock); + if (current->signal->cred_locked_in_execve) + return -EAGAIN; + /* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) { diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES) diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall -TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2020 Bernd Edlinger <bernd.edlinger@hotmail.de> + * All rights reserved. + * + * Check whether /proc/$pid/mem can be accessed without causing deadlocks + * when de_thread is blocked with ->cred_guard_mutex held. + */ + +#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h> + +static void *thread(void *arg) +{ + ptrace(PTRACE_TRACEME, 0, 0L, 0L); + return NULL; +} + +TEST(vmaccess) +{ + int f, pid = fork(); + char mm[64]; + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + sprintf(mm, "/proc/%d/mem", pid); + f = open(mm, O_RDONLY); + ASSERT_LE(0, f); + close(f); + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST(attach) +{ + int f, pid = fork(); + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(EAGAIN, errno); + ASSERT_EQ(f, -1); + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST_HARNESS_MAIN
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running. I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access. The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally. This changes the lifetime of the cred_guard_mutex lock to be: - during prepare_bprm_creds() - from flush_old_exec() through install_exec_creds() Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds(). I also took the opportunity to improve the documentation of prepare_creds, which is obviously out of sync. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> --- Documentation/security/credentials.rst | 19 +++++---- fs/exec.c | 41 ++++++++++++++++--- include/linux/binfmts.h | 6 ++- include/linux/sched/signal.h | 2 + kernel/cred.c | 2 +- kernel/ptrace.c | 4 ++ kernel/seccomp.c | 3 ++ mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++ 10 files changed, 130 insertions(+), 19 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c v2: adds a test case which passes when this patch is applied. v3: fixes the issue without introducing a new mutex. v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case. v5: addresses review comments.