Message ID | 20180604195858.GD12217@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Mon, Jun 4, 2018 at 12:59 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > The below will of course conflict with the merge request under > discussion. Also completely untested. Side note: it's not just swake_up() itself. It's very much the "prepare_to_swait()" logic and the event waiting friends. For a regular non-exclusive wait-queue, this is all reasonably simple. Want to wait on an event? Go right ahead. You waiting on it doesn't affect anybody else waiting on it. For an swait() queue, though? What are the semantics of two different users doing something like swait_event_interruptible(wq, <event>); when coupled with "swake_up_one()"? The whole notion of "wait until this is true" is not an exclusive wait per se. And being "interruptible" means that we definitely can be in the situation that we added ourselves to the wait-queue, but then a signal happened, so we're going to exit. But what if we were the *one* thread who got woken up? We can't just return with an error code, because then we've lost a wakeup. We do actually have this issue for regular wait-queues too, and I think we should try to clarify it there as well: wait_event_interruptible_exclusive() is simply a pretty dangerous thing to do. But at least (again) it has that "exclusive()" in the name, so _hopefully_ users think about it. The danger is pretty explicit. The rule for exclusive waits has to be that even if you're interruptible, you have to first check the event you're waiting for - because you *have* to take it if it's there. Alternatively, if you're returning with -EINTR or something, and you were woken up, you need to do wake_up_one() on the queue you were waiting on, but now it's already hard to tell whether you were woken up by the event, or by you being interruptible, so that's already more complicated (but it's always safe - but might be bad for performance - to wake up too much). So I really wish that all the swait interfaces had more of a red flag. Because right now they all look simple, and they really aren't. They are all equivalent to the really *complex* cases of the regular wait-queues. Again, the "normal" wait queue interfaces that look simple really _are_ simple. It's simply safe to call wait_event{_interruptible}(wq, <event>) and there are absolutely no subtle gotchas. You're not affecting any other waiters, and it's doing exactly what you think it's doing from reading the code. In contrast, swait_event(wq, <event>) is much subtler. You're not just waiting on the event, you're also potentially blocking some other waiter, because only one of you will be woken up. So maybe along with "swake_up()" -> "swake_up_one()", we should also make all the "swait_event*()" functions be renamed as "swait_event*_exclusive()". Same for the "prepare_to_swait()" etc cases, add the "_exclusive()" there at the end to mark their special semantics. Then the "swait" functions would actually have the same semantics and gotcha's as the (very specialized) subset of regular wait-queue event functions. I think that would cover my worries. Hmm? Linus -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Jun 04 2018 at 3:58pm -0400, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Jun 04, 2018 at 12:39:11PM -0700, Linus Torvalds wrote: > > On Mon, Jun 4, 2018 at 12:37 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Would it help if we did s/swake_up/swake_up_one/g ? > > > > > > Then there would not be an swake_up() to cause confusion. > > > > Yes, i think that would already be a big improvement, forcing people > > to be aware of the exclusive nature. > > The below will of course conflict with the merge request under > discussion. Also completely untested. No worries there since I'll be resubmitting dm-writecache for 4.19. (Mikulas would like to still use swait for the dm-writecache's endio thread, since endio_thread_wait only has a single waiter. I told him to convert the other 2, benchmark it with still swait in endio path, then convert the one used in endio, benchmark will all converted and we'd revisit if there is a compelling performance difference. But even then I'm not sure I want DM on the list of swait consumers... to be continued) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Jun 4, 2018 at 2:13 PM Mike Snitzer <snitzer@redhat.com> wrote: > > (Mikulas would like to still use swait for the dm-writecache's endio > thread, since endio_thread_wait only has a single waiter. If you already know it has a single waiter, please don't use a queue at all. Just have the "struct task_struct *" in the waiter field, and use "wake_up_process()". Use NULL for "no process". That's *much* simpler than swait(), and is a well-accepted traditional wake interface. It's also really really obvious. The "there is only a single waiter" is *NOT* an excuse for using swait. Quite the reverse. Using swait is stupid, slow, and complex. And it generates code that is harder to understand. And yes, the fact that KVM also made that completely idiotic choice in their apic code is not an excuse either. I have no idea why they did it either. It's stupid. In the kvm case, I think what happened was that they had a historical wait-queue model, and they just didn't realize that they al;ways had just one waiter, so then they converted a waitqueue to a swait queue. But if you already know there is only ever one waiter, please don't do that. Just avoid queues entirely. Linus -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 4 Jun 2018, Linus Torvalds wrote: > On Mon, Jun 4, 2018 at 2:13 PM Mike Snitzer <snitzer@redhat.com> wrote: > > > > (Mikulas would like to still use swait for the dm-writecache's endio > > thread, since endio_thread_wait only has a single waiter. > > If you already know it has a single waiter, please don't use a queue at all. > > Just have the "struct task_struct *" in the waiter field, and use > "wake_up_process()". Use NULL for "no process". I'd be interested - does the kernel deal properly with spurious wake-up? - i.e. suppose that the kernel thread that I created is doing simething else in a completely different subsystem - can I call wake_up_process on it? Could it confuse some unrelated code? The commonly used synchronization primitives recheck the condition after wake-up, but it's hard to verify that the whole kernel does it. > That's *much* simpler than swait(), and is a well-accepted traditional > wake interface. It's also really really obvious. > > The "there is only a single waiter" is *NOT* an excuse for using > swait. Quite the reverse. Using swait is stupid, slow, and complex. > And it generates code that is harder to understand. It looked to me like the standard wait-queues suffers from feature creep (three flags, high number of functions abd macros, it even uses an indirect call to wake something up) - that's why I used swait. > And yes, the fact that KVM also made that completely idiotic choice in > their apic code is not an excuse either. I have no idea why they did > it either. It's stupid. In the kvm case, I think what happened was > that they had a historical wait-queue model, and they just didn't > realize that they al;ways had just one waiter, so then they converted > a waitqueue to a swait queue. > > But if you already know there is only ever one waiter, please don't do > that. Just avoid queues entirely. > > Linus Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Jun 4, 2018 at 2:53 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > I'd be interested - does the kernel deal properly with spurious wake-up? - > i.e. suppose that the kernel thread that I created is doing simething else > in a completely different subsystem - can I call wake_up_process on it? > Could it confuse some unrelated code? We've always had that issue, and yes, we should handle it fine. Code that doesn't handle it fine is broken, but I don't think we've ever had that situation. For example, a common case of "spurious" wakeups is when somebody adds itself to a wait list, but then ends up doing other things (including taking page faults because of user access etc). The wait-list is still active, and events on the wait list will still wake people up, even if they are sleeping on some *other* list too. In fact, an example of spurious wakeups comes from just using regular futexes. We send those locklessly, and you actually can get a futex wakeup *after* you thought you removed yourself from the futex queue. But that's actually only an example of the much more generic issue - we've always supported having multiple sources of wakeups, so "spurious" wakups have always been a thing. People are probably not so aware of it, because they've never been an actual _problem_. Why? Our sleep/wake model has never been that "I woke up, so what I waited on must be done". Our sleep/wake model has always been one where being woken up just means that you go back and repeat the checks. The whole "wait_event()" loop being the most core example of that model, but that's actually not the *traditional* model. Our really traditional model of waiting for something actually predates wait_event(), and is an explicit loop like add_to_wait_queue(..); for (;;) { set_task_state(TASK_INTERRUPTIBLE); .. see if we need to sleep, exit if ok .. schedule(); } remove_from_wait_queue(..); so even pretty much from day #1, the whole notion of "spurious wake events" is a non-issue. (We did have a legacy "sleep_on()" interface back in the dark ages, but even that was supposed to be used in a loop). > The commonly used synchronization primitives recheck the condition after > wake-up, but it's hard to verify that the whole kernel does it. See above. We have those spurious wakeups already. > It looked to me like the standard wait-queues suffers from feature creep > (three flags, high number of functions abd macros, it even uses an > indirect call to wake something up) - that's why I used swait. I agree that the standard wait-queues have gotten much more complex over the years. But apart from the wait entries being a bit big, they actually should not perform badly., The real problem with wait-queues is that because of their semantics, you *can* end up walking the whole queue, waking up hundreds (or thousands) of processes. That can be a latency issue for RT. But the answer to that tends to be "don't do that then". If you have wait-queues that can have thousands of entries, there's likely something seriously wrong somewhere. We've had it, but it's very very rare. Linus -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Jun 04, 2018 at 03:16:31PM -0700, Linus Torvalds wrote: > We've always had that issue, and yes, we should handle it fine. Code > that doesn't handle it fine is broken, but I don't think we've ever > had that situation. We've had a whole bunch of broken. We fixed a pile of them a few years back but I'm sure that if you look hard you can still find a few. The one commit I can readily find is: 91f65facba5a ("iommu/amd: Fix amd_iommu_free_device()") But I'm fairly sure there was more.. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Jun 5, 2018 at 1:59 AM Peter Zijlstra <peterz@infradead.org> wrote: > > We've had a whole bunch of broken. We fixed a pile of them a few > years back but I'm sure that if you look hard you can still find a few. > > The one commit I can readily find is: > > 91f65facba5a ("iommu/amd: Fix amd_iommu_free_device()") Ugh, ok, I stand corrected. These things definitely are bugs and they aren't even because of some "old model", because we've pretty much always had the "loop to wait" rule. Even going back to linux-0.01, when we had that "sleep_on()" primitive that is really bad (because of fundamental races), we had code like this: static inline void wait_on_buffer(struct buffer_head * bh) { cli(); while (bh->b_lock) sleep_on(&bh->b_wait); sti(); } and the above wasn't some fluke that I searched for, they're all that way (and by "all" I mean "the first three I looked at", which probably means that if I had looked at a fourth, I'd have found somebody violating the "loop on condition" rule ;) Linus -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 0f725e9cee8f..612bc713c4a1 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -515,7 +515,7 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, dvcpu->arch.wait = 0; if (swq_has_sleeper(&dvcpu->wq)) - swake_up(&dvcpu->wq); + swake_up_one(&dvcpu->wq); return 0; } @@ -1204,7 +1204,7 @@ static void kvm_mips_comparecount_func(unsigned long data) vcpu->arch.wait = 0; if (swq_has_sleeper(&vcpu->wq)) - swake_up(&vcpu->wq); + swake_up_one(&vcpu->wq); } /* low level hrtimer wake routine */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 4d07fca5121c..f1bcf1875171 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -190,7 +190,7 @@ static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu *vcpu) wqp = kvm_arch_vcpu_wq(vcpu); if (swq_has_sleeper(wqp)) { - swake_up(wqp); + swake_up_one(wqp); ++vcpu->stat.halt_wakeup; } @@ -3224,7 +3224,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) kvmppc_start_thread(vcpu, vc); trace_kvm_guest_enter(vcpu); } else if (vc->vcore_state == VCORE_SLEEPING) { - swake_up(&vc->wq); + swake_up_one(&vc->wq); } } diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index 37d06e022238..f58113cbee4d 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -1145,7 +1145,7 @@ void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu) * yield-candidate. */ vcpu->preempted = true; - swake_up(&vcpu->wq); + swake_up_one(&vcpu->wq); vcpu->stat.halt_wakeup++; } /* diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 5b2300b818af..db6ebe48d991 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -188,7 +188,7 @@ static void apf_task_wake_one(struct kvm_task_sleep_node *n) if (n->halted) smp_send_reschedule(n->cpu); else if (swq_has_sleeper(&n->wq)) - swake_up(&n->wq); + swake_up_one(&n->wq); } static void apf_task_wake_all(void) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index b74c9c1405b9..a2f8c4c76d33 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1379,7 +1379,7 @@ static void apic_timer_expired(struct kvm_lapic *apic) * using swait_active() is safe. */ if (swait_active(q)) - swake_up(q); + swake_up_one(q); if (apic_lvtt_tscdeadline(apic)) ktimer->expired_tscdeadline = ktimer->tscdeadline; diff --git a/include/linux/swait.h b/include/linux/swait.h index 84f9745365ff..6224bbb08cf5 100644 --- a/include/linux/swait.h +++ b/include/linux/swait.h @@ -145,7 +145,7 @@ static inline bool swq_has_sleeper(struct swait_queue_head *wq) return swait_active(wq); } -extern void swake_up(struct swait_queue_head *q); +extern void swake_up_one(struct swait_queue_head *q); extern void swake_up_all(struct swait_queue_head *q); extern void swake_up_locked(struct swait_queue_head *q); diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 622792abe41a..790876228824 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -110,7 +110,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx) WRITE_ONCE(sp->srcu_lock_nesting[idx], newval); if (!newval && READ_ONCE(sp->srcu_gp_waiting)) - swake_up(&sp->srcu_wq); + swake_up_one(&sp->srcu_wq); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4fccdfa25716..c5fbd74c8af0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1727,7 +1727,7 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp) !READ_ONCE(rsp->gp_flags) || !rsp->gp_kthread) return; - swake_up(&rsp->gp_wq); + swake_up_one(&rsp->gp_wq); } /* diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index d40708e8c5d6..027a432a167d 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -212,7 +212,7 @@ static void __rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp, raw_spin_unlock_irqrestore_rcu_node(rnp, flags); if (wake) { smp_mb(); /* EGP done before wake_up(). */ - swake_up(&rsp->expedited_wq); + swake_up_one(&rsp->expedited_wq); } break; } diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 7fd12039e512..1245610f689d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1854,8 +1854,8 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, WRITE_ONCE(rdp_leader->nocb_leader_sleep, false); del_timer(&rdp->nocb_timer); raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - smp_mb(); /* ->nocb_leader_sleep before swake_up(). */ - swake_up(&rdp_leader->nocb_wq); + smp_mb(); /* ->nocb_leader_sleep before swake_up_one(). */ + swake_up_one(&rdp_leader->nocb_wq); } else { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); } @@ -2176,7 +2176,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); if (rdp != my_rdp && tail == &rdp->nocb_follower_head) { /* List was empty, so wake up the follower. */ - swake_up(&rdp->nocb_wq); + swake_up_one(&rdp->nocb_wq); } } diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c index b6fb2c3b3ff7..e471a58f565c 100644 --- a/kernel/sched/swait.c +++ b/kernel/sched/swait.c @@ -32,7 +32,7 @@ void swake_up_locked(struct swait_queue_head *q) } EXPORT_SYMBOL(swake_up_locked); -void swake_up(struct swait_queue_head *q) +void swake_up_one(struct swait_queue_head *q) { unsigned long flags; @@ -40,7 +40,7 @@ void swake_up(struct swait_queue_head *q) swake_up_locked(q); raw_spin_unlock_irqrestore(&q->lock, flags); } -EXPORT_SYMBOL(swake_up); +EXPORT_SYMBOL(swake_up_one); /* * Does not allow usage from IRQ disabled, since we must be able to diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c index a4c1b76240df..172e82b75e3f 100644 --- a/virt/kvm/arm/arm.c +++ b/virt/kvm/arm/arm.c @@ -586,7 +586,7 @@ void kvm_arm_resume_guest(struct kvm *kvm) kvm_for_each_vcpu(i, vcpu, kvm) { vcpu->arch.pause = false; - swake_up(kvm_arch_vcpu_wq(vcpu)); + swake_up_one(kvm_arch_vcpu_wq(vcpu)); } } diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c index c4762bef13c6..18effcb166ad 100644 --- a/virt/kvm/arm/psci.c +++ b/virt/kvm/arm/psci.c @@ -155,7 +155,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) smp_mb(); /* Make sure the above is visible */ wq = kvm_arch_vcpu_wq(vcpu); - swake_up(wq); + swake_up_one(wq); return PSCI_RET_SUCCESS; } diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index 57bcb27dcf30..23c2519c5b32 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -107,7 +107,7 @@ static void async_pf_execute(struct work_struct *work) trace_kvm_async_pf_completed(addr, gva); if (swq_has_sleeper(&vcpu->wq)) - swake_up(&vcpu->wq); + swake_up_one(&vcpu->wq); mmput(mm); kvm_put_kvm(vcpu->kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c7b2e927f699..8cbf1276ed2e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2197,7 +2197,7 @@ bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu) wqp = kvm_arch_vcpu_wq(vcpu); if (swq_has_sleeper(wqp)) { - swake_up(wqp); + swake_up_one(wqp); ++vcpu->stat.halt_wakeup; return true; }