Message ID | 1d935e9d87fd8672ef3e8a9a0db340d355ea08b4.1720368770.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fix task_work interation with freezing | expand |
On 07/07, Pavel Begunkov wrote: > > io_uring can asynchronously add a task_work while the task is getting > freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in > do_freezer_trap(), and since the get_signal()'s relock loop doesn't > retry task_work, the task will spin there not being able to sleep > until the freezing is cancelled / the task is killed / etc. > > Cc: stable@vger.kernel.org > Link: https://github.com/systemd/systemd/issues/33626 > Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") I don't think we should blame io_uring even if so far it is the only user of TWA_SIGNAL. Perhaps we should change do_freezer_trap() somehow, not sure... It assumes that TIF_SIGPENDING is the only reason to not sleep in TASK_INTERRUPTIBLE, today this is not true. > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) > try_to_freeze(); > > relock: > + clear_notify_signal(); > + if (unlikely(task_work_pending(current))) > + task_work_run(); > + > spin_lock_irq(&sighand->siglock); Well, but can't we kill the same code at the start of get_signal() then? Of course, in this case get_signal() should check signal_pending(), not task_sigpending(). Or perhaps something like the patch below makes more sense? I dunno... Oleg. diff --git a/kernel/signal.c b/kernel/signal.c index 1f9dd41c04be..e2ae85293fbb 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig) struct signal_struct *signal = current->signal; int signr; +start: clear_notify_signal(); if (unlikely(task_work_pending(current))) task_work_run(); @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig) if (current->jobctl & JOBCTL_TRAP_MASK) { do_jobctl_trap(); spin_unlock_irq(&sighand->siglock); + goto relock; } else if (current->jobctl & JOBCTL_TRAP_FREEZE) do_freezer_trap(); - - goto relock; + goto start; + } } /*
On 7/8/24 11:42, Oleg Nesterov wrote: > On 07/07, Pavel Begunkov wrote: >> >> io_uring can asynchronously add a task_work while the task is getting >> freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in >> do_freezer_trap(), and since the get_signal()'s relock loop doesn't >> retry task_work, the task will spin there not being able to sleep >> until the freezing is cancelled / the task is killed / etc. >> >> Cc: stable@vger.kernel.org >> Link: https://github.com/systemd/systemd/issues/33626 >> Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") > > I don't think we should blame io_uring even if so far it is the only user > of TWA_SIGNAL. And it's not entirely correct even for backporting purposes, I'll pin it to when freezing was introduced then. > Perhaps we should change do_freezer_trap() somehow, not sure... It assumes > that TIF_SIGPENDING is the only reason to not sleep in TASK_INTERRUPTIBLE, > today this is not true. Let's CC Peter Zijlstra and Tejun in case they might have some input on that. Link to this patch for convenience: https://lore.kernel.org/all/1d935e9d87fd8672ef3e8a9a0db340d355ea08b4.1720368770.git.asml.silence@gmail.com/ >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) >> try_to_freeze(); >> >> relock: >> + clear_notify_signal(); >> + if (unlikely(task_work_pending(current))) >> + task_work_run(); >> + >> spin_lock_irq(&sighand->siglock); > > Well, but can't we kill the same code at the start of get_signal() then? > Of course, in this case get_signal() should check signal_pending(), not > task_sigpending(). Should be fine, but I didn't want to change the try_to_freeze() -> __refrigerator() path, which also reschedules. > Or perhaps something like the patch below makes more sense? I dunno... It needs a far backporting, I'd really prefer to keep it lean and without more side effects if possible, unless there is a strong opinion on that. > Oleg. > > diff --git a/kernel/signal.c b/kernel/signal.c > index 1f9dd41c04be..e2ae85293fbb 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2676,6 +2676,7 @@ bool get_signal(struct ksignal *ksig) > struct signal_struct *signal = current->signal; > int signr; > > +start: > clear_notify_signal(); > if (unlikely(task_work_pending(current))) > task_work_run(); > @@ -2760,10 +2761,11 @@ bool get_signal(struct ksignal *ksig) > if (current->jobctl & JOBCTL_TRAP_MASK) { > do_jobctl_trap(); > spin_unlock_irq(&sighand->siglock); > + goto relock; > } else if (current->jobctl & JOBCTL_TRAP_FREEZE) > do_freezer_trap(); > - > - goto relock; > + goto start; > + } > } > > /* >
On 07/08, Pavel Begunkov wrote: > > On 7/8/24 11:42, Oleg Nesterov wrote: > >I don't think we should blame io_uring even if so far it is the only user > >of TWA_SIGNAL. > > And it's not entirely correct even for backporting purposes, > I'll pin it to when freezing was introduced then. This is another problem introduced by 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") We need much more changes. Say, zap_threads() does the same and assumes that only SIGKILL or freezeing can make dump_interrupted() true. There are more similar problems. I'll try to think, so far I do not see a simple solution... As for this particular problem, I agree it needs a simple/backportable fix. > >> relock: > >>+ clear_notify_signal(); > >>+ if (unlikely(task_work_pending(current))) > >>+ task_work_run(); > >>+ > >> spin_lock_irq(&sighand->siglock); > > > >Well, but can't we kill the same code at the start of get_signal() then? > >Of course, in this case get_signal() should check signal_pending(), not > >task_sigpending(). > > Should be fine, Well, not really at least performance-wise... get_signal() should return asap if TIF_NOTIFY_SIGNAL was the only reason to call get_signal(). > but I didn't want to change the > try_to_freeze() -> __refrigerator() path, which also reschedules. Could you spell please? > >Or perhaps something like the patch below makes more sense? I dunno... > > It needs a far backporting, I'd really prefer to keep it > lean and without more side effects if possible, unless > there is a strong opinion on that. Well, I don't think my patch is really worse in this sense. Just it is buggy ;) it needs another recalc_sigpending() before goto start, so lets forget it. So I am starting to agree with your change as a workaround until we find a clean solution (if ever ;). But can I ask you to add this additional clear_notify_signal() + task_work_run() to the end of do_freezer_trap() ? get_signal() is already a mess... ----------------------------------------------------------------------- Either way I have no idea whether a cgroup_task_frozen() task should react to task_work_add(TWA_SIGNAL) or not. Documentation/admin-guide/cgroup-v2.rst says Writing "1" to the file causes freezing of the cgroup and all descendant cgroups. This means that all belonging processes will be stopped and will not run until the cgroup will be explicitly unfrozen. AFAICS this is not accurate, they can run but can't return to user-mode. So I guess task_work_run() is fine. Oleg.
On 7/9/24 11:36, Oleg Nesterov wrote: > On 07/08, Pavel Begunkov wrote: >> >> On 7/8/24 11:42, Oleg Nesterov wrote: >>> I don't think we should blame io_uring even if so far it is the only user >>> of TWA_SIGNAL. >> >> And it's not entirely correct even for backporting purposes, >> I'll pin it to when freezing was introduced then. > > This is another problem introduced by 12db8b690010 ("entry: Add support for > TIF_NOTIFY_SIGNAL") Ah, yes, I forgot NOTIFY_SIGNAL was split out of SIGPENDING > We need much more changes. Say, zap_threads() does the same and assumes > that only SIGKILL or freezeing can make dump_interrupted() true. > > There are more similar problems. I'll try to think, so far I do not see > a simple solution... Thanks. And there was some patching done before against dumping being interrupted by task_work, indeed a reoccurring issue. > As for this particular problem, I agree it needs a simple/backportable fix. > >>>> relock: >>>> + clear_notify_signal(); >>>> + if (unlikely(task_work_pending(current))) >>>> + task_work_run(); >>>> + >>>> spin_lock_irq(&sighand->siglock); >>> >>> Well, but can't we kill the same code at the start of get_signal() then? >>> Of course, in this case get_signal() should check signal_pending(), not >>> task_sigpending(). >> >> Should be fine, > > Well, not really at least performance-wise... get_signal() should return > asap if TIF_NOTIFY_SIGNAL was the only reason to call get_signal(). > >> but I didn't want to change the >> try_to_freeze() -> __refrigerator() path, which also reschedules. > > Could you spell please? Let's say it calls get_signal() for freezing with a task_work pending. Currently, it executes task_work and calls try_to_freeze(), which puts the task to sleep. If we remove that task_work_run() before try_to_freeze(), it would not be able to sleep. Sounds like it should be fine, it races anyway, but I'm trying to avoid side effect for fixes. >>> Or perhaps something like the patch below makes more sense? I dunno... >> >> It needs a far backporting, I'd really prefer to keep it >> lean and without more side effects if possible, unless >> there is a strong opinion on that. > > Well, I don't think my patch is really worse in this sense. Just it > is buggy ;) it needs another recalc_sigpending() before goto start, > so lets forget it. > > So I am starting to agree with your change as a workaround until we > find a clean solution (if ever ;). > > But can I ask you to add this additional clear_notify_signal() + > task_work_run() to the end of do_freezer_trap() ? get_signal() is > already a mess... Will change > ----------------------------------------------------------------------- > Either way I have no idea whether a cgroup_task_frozen() task should > react to task_work_add(TWA_SIGNAL) or not. > > Documentation/admin-guide/cgroup-v2.rst says > > Writing "1" to the file causes freezing of the cgroup and all > descendant cgroups. This means that all belonging processes will > be stopped and will not run until the cgroup will be explicitly > unfrozen. > > AFAICS this is not accurate, they can run but can't return to user-mode. > So I guess task_work_run() is fine. IIUC it's a user facing doc, so maybe it's accurate enough from that perspective. But I do agree that the semantics around task_work is not exactly clear.
Hello, On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote: > > ----------------------------------------------------------------------- > > Either way I have no idea whether a cgroup_task_frozen() task should > > react to task_work_add(TWA_SIGNAL) or not. > > > > Documentation/admin-guide/cgroup-v2.rst says > > > > Writing "1" to the file causes freezing of the cgroup and all > > descendant cgroups. This means that all belonging processes will > > be stopped and will not run until the cgroup will be explicitly > > unfrozen. > > > > AFAICS this is not accurate, they can run but can't return to user-mode. > > So I guess task_work_run() is fine. > > IIUC it's a user facing doc, so maybe it's accurate enough from that > perspective. But I do agree that the semantics around task_work is > not exactly clear. A good correctness test for cgroup freezer is whether it'd be safe to snapshot and restore the tasks in the cgroup while frozen. If that works, it's most likely fine. Thanks.
Hi Tejun, Thanks for looking at this, can you review this V2 patch from Pavel? To me it makes sense even without 1/2 which I didn't even bother to read. At least as a simple workaround for now. On 07/09, Tejun Heo wrote: > > Hello, > > On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote: > > > ----------------------------------------------------------------------- > > > Either way I have no idea whether a cgroup_task_frozen() task should > > > react to task_work_add(TWA_SIGNAL) or not. > > > > > > Documentation/admin-guide/cgroup-v2.rst says > > > > > > Writing "1" to the file causes freezing of the cgroup and all > > > descendant cgroups. This means that all belonging processes will > > > be stopped and will not run until the cgroup will be explicitly > > > unfrozen. > > > > > > AFAICS this is not accurate, they can run but can't return to user-mode. > > > So I guess task_work_run() is fine. > > > > IIUC it's a user facing doc, so maybe it's accurate enough from that > > perspective. But I do agree that the semantics around task_work is > > not exactly clear. > > A good correctness test for cgroup freezer is whether it'd be safe to > snapshot and restore the tasks in the cgroup while frozen. Well, I don't really understand what can snapshot/restore actually mean... I forgot everything about cgroup freezer and I am already sleeping, but even if we forget about task_work_add/TIF_NOTIFY_SIGNAL/etc, afaics ptrace can change the state of cgroup_task_frozen() task between snapshot and restore ? Oleg.
On 7/9/24 20:07, Oleg Nesterov wrote: > Hi Tejun, > > Thanks for looking at this, can you review this V2 patch from Pavel? > To me it makes sense even without 1/2 which I didn't even bother to > read. At least as a simple workaround for now. They are kind of separate but without 1/2 this patch creates another infinite loop, even though it's harder to hit and is io_uring specific. > On 07/09, Tejun Heo wrote: >> >> Hello, >> >> On Tue, Jul 09, 2024 at 03:05:21PM +0100, Pavel Begunkov wrote: >>>> ----------------------------------------------------------------------- >>>> Either way I have no idea whether a cgroup_task_frozen() task should >>>> react to task_work_add(TWA_SIGNAL) or not. >>>> >>>> Documentation/admin-guide/cgroup-v2.rst says >>>> >>>> Writing "1" to the file causes freezing of the cgroup and all >>>> descendant cgroups. This means that all belonging processes will >>>> be stopped and will not run until the cgroup will be explicitly >>>> unfrozen. >>>> >>>> AFAICS this is not accurate, they can run but can't return to user-mode. >>>> So I guess task_work_run() is fine. >>> >>> IIUC it's a user facing doc, so maybe it's accurate enough from that >>> perspective. But I do agree that the semantics around task_work is >>> not exactly clear. >> >> A good correctness test for cgroup freezer is whether it'd be safe to >> snapshot and restore the tasks in the cgroup while frozen. > > Well, I don't really understand what can snapshot/restore actually mean... CRIU, I assume. I'll try it ... > I forgot everything about cgroup freezer and I am already sleeping, but even > if we forget about task_work_add/TIF_NOTIFY_SIGNAL/etc, afaics ptrace can > change the state of cgroup_task_frozen() task between snapshot and restore ? ... but I'm inclined to think the patch makes sense regardless, we're replacing an infinite loop with wait-wake-execute-wait.
On 07/09, Pavel Begunkov wrote: > > On 7/9/24 20:07, Oleg Nesterov wrote: > >Hi Tejun, > > > >Thanks for looking at this, can you review this V2 patch from Pavel? Just in case, I obviously meant our next (V2) patch [PATCH v2 2/2] kernel: rerun task_work while freezing in get_signal() https://lore.kernel.org/all/149ff5a762997c723880751e8a4019907a0b6457.1720534425.git.asml.silence@gmail.com/ > >Well, I don't really understand what can snapshot/restore actually mean... > > CRIU, I assume. I'll try it ... Than I think we can forget about task_works and this patch. CRIU dumps the tasks in TASK_TRACED state. > ... but I'm inclined to think the patch makes sense regardless, > we're replacing an infinite loop with wait-wake-execute-wait. Agreed. Oleg.
On 7/9/24 20:38, Oleg Nesterov wrote: > On 07/09, Pavel Begunkov wrote: >> >> On 7/9/24 20:07, Oleg Nesterov wrote: >>> Hi Tejun, >>> >>> Thanks for looking at this, can you review this V2 patch from Pavel? > > Just in case, I obviously meant our next (V2) patch > > [PATCH v2 2/2] kernel: rerun task_work while freezing in get_signal() > https://lore.kernel.org/all/149ff5a762997c723880751e8a4019907a0b6457.1720534425.git.asml.silence@gmail.com/ > >>> Well, I don't really understand what can snapshot/restore actually mean... >> >> CRIU, I assume. I'll try it ... > > Than I think we can forget about task_works and this patch. CRIU dumps > the tasks in TASK_TRACED state. And would be hard to test, io_uring (the main source of task_work) is not supported (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring]) ... (00.467642) Unfreezing tasks into 1 (00.467656) Unseizing 15488 into 1 (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED. >> ... but I'm inclined to think the patch makes sense regardless, >> we're replacing an infinite loop with wait-wake-execute-wait. > > Agreed.
Hello, On Tue, Jul 09, 2024 at 08:55:43PM +0100, Pavel Begunkov wrote: ... > > > CRIU, I assume. I'll try it ... > > > > Than I think we can forget about task_works and this patch. CRIU dumps > > the tasks in TASK_TRACED state. > > And would be hard to test, io_uring (the main source of task_work) > is not supported > > (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring]) > ... > (00.467642) Unfreezing tasks into 1 > (00.467656) Unseizing 15488 into 1 > (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED. Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks and then go around tracing each to make dump, would the freezer be enough in avoiding interim state changes? Using CRIU implementation is a bit arbitrary but I think checkpoint-restart is a useful bar to measure what should stay stable while a cgroup is frozen. Thanks.
On 7/10/24 01:54, Tejun Heo wrote: > Hello, > > On Tue, Jul 09, 2024 at 08:55:43PM +0100, Pavel Begunkov wrote: > ... >>>> CRIU, I assume. I'll try it ... >>> >>> Than I think we can forget about task_works and this patch. CRIU dumps >>> the tasks in TASK_TRACED state. >> >> And would be hard to test, io_uring (the main source of task_work) >> is not supported >> >> (00.466022) Error (criu/proc_parse.c:477): Unknown shit 600 (anon_inode:[io_uring]) >> ... >> (00.467642) Unfreezing tasks into 1 >> (00.467656) Unseizing 15488 into 1 >> (00.468149) Error (criu/cr-dump.c:2111): Dumping FAILED. > > Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks > and then go around tracing each to make dump, would the freezer be enough in > avoiding interim state changes? Using CRIU implementation is a bit arbitrary > but I think checkpoint-restart is a useful bar to measure what should stay > stable while a cgroup is frozen. Sounds like in the long run we might want to ignore task_work while it's frozen, but hard to say for all task_work users.
On 07/10, Pavel Begunkov wrote: > > On 7/10/24 01:54, Tejun Heo wrote: > >Yeah, the question is: If CRIU is to use cgroup freezer to freeze the tasks > >and then go around tracing each to make dump, would the freezer be enough in > >avoiding interim state changes? Using CRIU implementation is a bit arbitrary > >but I think checkpoint-restart is a useful bar to measure what should stay > >stable while a cgroup is frozen. > > Sounds like in the long run we might want to ignore task_work while > it's frozen, Just in case, this is what I have in mind right now, but I am still not sure and can't make a "clean" patch. If nothing else. CRIU needs to attach and make this task TASK_TRACED, right? And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL). Oleg.
Hello, On Wed, Jul 10, 2024 at 09:10:16PM +0200, Oleg Nesterov wrote: ... > If nothing else. CRIU needs to attach and make this task TASK_TRACED, right? Yeah, AFAIK, that's the only way to implement check-pointing for now. > And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL). I don't know how task_work is being used but the requirement would be that if a cgroup is frozen, task_works shouldn't be making state changes which can't safely be replayed (e.g. by restarting the frozen syscalls). If anything task_works do can just be reproduced by restarting the currently in-flight and frozen syscalls, it should be okay to leave them running. Otherwise, it'd be better to freeze them together. As this thing is kinda difficult to reason about, it'd probably be easier to just freeze them together if we can. Thanks.
On 07/10, Tejun Heo wrote: > > Hello, > > On Wed, Jul 10, 2024 at 09:10:16PM +0200, Oleg Nesterov wrote: > ... > > If nothing else. CRIU needs to attach and make this task TASK_TRACED, right? > > Yeah, AFAIK, that's the only way to implement check-pointing for now. OK, > > And once the target task is traced, it won't react to task_work_add(TWA_SIGNAL). > > I don't know how task_work is being used but the requirement would be that > if a cgroup is frozen, task_works shouldn't be making state changes which > can't safely be replayed (e.g. by restarting the frozen syscalls). Well, in theory task_work can do "anything". Of course, it can't, say, restart a frozen syscall, task_work_run() just executes the callbacks in kernel mode and returns. > it'd be better to freeze them together. And I tend to agree. simply beacase do_freezer_trap() (and more users of clear_thread_flag(TIF_SIGPENDING) + schedule(TASK_INTERRUPTIBLE) pattern) do not take TIF_NOTIFY_SIGNAL into account. But how do you think this patch can make the things worse wrt CRIU ? And let's even forget this patch which fixes the real problem. How do you think the fact that the task sleeping in do_freezer_trap() can react to TIF_NOTIFY_SIGNAL, call task_work_run(), and then sleep in do_freezer_trap() again can make any difference in this sense? > As this thing is kinda difficult to reason about, Agreed, > it'd probably be easier to just freeze them together if we can. Agreed, but this needs some "generic" changes while Pavel needs a simple and backportable workaround to suppress a real problem. In short, I don't like this patch either, I just don't see a better solution for now ;) Thanks, Oleg.
Hello, On Wed, Jul 10, 2024 at 11:34:19PM +0200, Oleg Nesterov wrote: ... > > it'd be better to freeze them together. > > And I tend to agree. simply beacase do_freezer_trap() (and more users of > clear_thread_flag(TIF_SIGPENDING) + schedule(TASK_INTERRUPTIBLE) pattern) > do not take TIF_NOTIFY_SIGNAL into account. > > But how do you think this patch can make the things worse wrt CRIU ? Oh, I'm not arguing against the patch at all. Just daydreaming about what future cleanups should look like. ... > In short, I don't like this patch either, I just don't see a better > solution for now ;) I think we're on the same page. Thanks.
On 07/10, Tejun Heo wrote: > > > But how do you think this patch can make the things worse wrt CRIU ? > > Oh, I'm not arguing against the patch at all. Just daydreaming about what > future cleanups should look like. Ah, sorry, I misunderstood you! > > In short, I don't like this patch either, I just don't see a better > > solution for now ;) > > I think we're on the same page. Yes, and I agree with your concerns and your thoughts about future cleanups. Oleg.
diff --git a/kernel/signal.c b/kernel/signal.c index 1f9dd41c04be..790d60fcfff0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2694,6 +2694,10 @@ bool get_signal(struct ksignal *ksig) try_to_freeze(); relock: + clear_notify_signal(); + if (unlikely(task_work_pending(current))) + task_work_run(); + spin_lock_irq(&sighand->siglock); /*
io_uring can asynchronously add a task_work while the task is getting freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in do_freezer_trap(), and since the get_signal()'s relock loop doesn't retry task_work, the task will spin there not being able to sleep until the freezing is cancelled / the task is killed / etc. Cc: stable@vger.kernel.org Link: https://github.com/systemd/systemd/issues/33626 Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Reported-by: Julian Orth <ju.orth@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- kernel/signal.c | 4 ++++ 1 file changed, 4 insertions(+)