Message ID | 20240415210728.36949-1-victor@mojatatu.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 0f022d32c3eca477fbf79a205243a6123ed0fe11 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net/sched: Fix mirred deadlock on device recursion | expand |
Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Mon, 15 Apr 2024 18:07:28 -0300 you wrote: > From: Eric Dumazet <edumazet@google.com> > > When the mirred action is used on a classful egress qdisc and a packet is > mirrored or redirected to self we hit a qdisc lock deadlock. > See trace below. > > [..... other info removed for brevity....] > [ 82.890906] > [ 82.890906] ============================================ > [ 82.890906] WARNING: possible recursive locking detected > [ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W > [ 82.890906] -------------------------------------------- > [ 82.890906] ping/418 is trying to acquire lock: > [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: > __dev_queue_xmit+0x1778/0x3550 > [ 82.890906] > [ 82.890906] but task is already holding lock: > [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: > __dev_queue_xmit+0x1778/0x3550 > [ 82.890906] > [ 82.890906] other info that might help us debug this: > [ 82.890906] Possible unsafe locking scenario: > [ 82.890906] > [ 82.890906] CPU0 > [ 82.890906] ---- > [ 82.890906] lock(&sch->q.lock); > [ 82.890906] lock(&sch->q.lock); > [ 82.890906] > [ 82.890906] *** DEADLOCK *** > [ 82.890906] > [..... other info removed for brevity....] > > [...] Here is the summary with links: - [net] net/sched: Fix mirred deadlock on device recursion https://git.kernel.org/netdev/net/c/0f022d32c3ec You are awesome, thank you!
Hi all, I noticed today that this causes a userspace visible change in behaviour (and a regression in some of our tests) for transmitting to a device when it has no carrier, when noop_qdisc is assigned to it. Instead of silently dropping the packets, -ENOBUFS will be returned if the socket opted in to RECVERR. The reason for this is that the static noop_qdisc: struct Qdisc noop_qdisc = { .enqueue = noop_enqueue, .dequeue = noop_dequeue, .flags = TCQ_F_BUILTIN, .ops = &noop_qdisc_ops, .q.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock), .dev_queue = &noop_netdev_queue, .busylock = __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock), .gso_skb = { .next = (struct sk_buff *)&noop_qdisc.gso_skb, .prev = (struct sk_buff *)&noop_qdisc.gso_skb, .qlen = 0, .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.gso_skb.lock), }, .skb_bad_txq = { .next = (struct sk_buff *)&noop_qdisc.skb_bad_txq, .prev = (struct sk_buff *)&noop_qdisc.skb_bad_txq, .qlen = 0, .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock), }, }; doesn't have an owner set, and it's obviously not allocated via qdisc_alloc(). Thus, it defaults to 0, so if you get to it on CPU 0 (I was using ARCH=um which isn't even SMP) then it will just always run into the > + if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) { > + kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP); > + return NET_XMIT_DROP; > + } case. I'm not sure I understand the busylock logic well enough, so almost seems to me we shouldn't do this whole thing on the noop_qdisc at all, e.g. via tagging owner with -2 to say don't do it: --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3865,9 +3865,11 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, qdisc_run_end(q); rc = NET_XMIT_SUCCESS; } else { - WRITE_ONCE(q->owner, smp_processor_id()); + if (q->owner != -2) + WRITE_ONCE(q->owner, smp_processor_id()); rc = dev_qdisc_enqueue(skb, q, &to_free, txq); - WRITE_ONCE(q->owner, -1); + if (q->owner != -2) + WRITE_ONCE(q->owner, -1); if (qdisc_run_begin(q)) { if (unlikely(contended)) { spin_unlock(&q->busylock); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 2a637a17061b..e857e4638671 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -657,6 +657,7 @@ static struct netdev_queue noop_netdev_queue = { }; struct Qdisc noop_qdisc = { + .owner = -2, .enqueue = noop_enqueue, .dequeue = noop_dequeue, .flags = TCQ_F_BUILTIN, (and yes, I believe it doesn't need to be READ_ONCE for the check against -2 since that's mutually exclusive with all other values) Or maybe simply ignoring the value for the noop_qdisc: --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3822,7 +3822,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, return rc; } - if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) { + if (unlikely(q != &noop_qdisc && READ_ONCE(q->owner) == smp_processor_id())) { kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP); return NET_XMIT_DROP; } That's shorter, but I'm not sure if there might be other special cases... Or maybe someone can think of an even better fix? Thanks, johannes
On Fri, Jun 7, 2024 at 4:40 PM Johannes Berg <johannes@sipsolutions.net> wrote: > > Hi all, > > I noticed today that this causes a userspace visible change in behaviour > (and a regression in some of our tests) for transmitting to a device > when it has no carrier, when noop_qdisc is assigned to it. Instead of > silently dropping the packets, -ENOBUFS will be returned if the socket > opted in to RECVERR. > > The reason for this is that the static noop_qdisc: > > struct Qdisc noop_qdisc = { > .enqueue = noop_enqueue, > .dequeue = noop_dequeue, > .flags = TCQ_F_BUILTIN, > .ops = &noop_qdisc_ops, > .q.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock), > .dev_queue = &noop_netdev_queue, > .busylock = __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock), > .gso_skb = { > .next = (struct sk_buff *)&noop_qdisc.gso_skb, > .prev = (struct sk_buff *)&noop_qdisc.gso_skb, > .qlen = 0, > .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.gso_skb.lock), > }, > .skb_bad_txq = { > .next = (struct sk_buff *)&noop_qdisc.skb_bad_txq, > .prev = (struct sk_buff *)&noop_qdisc.skb_bad_txq, > .qlen = 0, > .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock), > }, > }; > > doesn't have an owner set, and it's obviously not allocated via > qdisc_alloc(). Thus, it defaults to 0, so if you get to it on CPU 0 (I > was using ARCH=um which isn't even SMP) then it will just always run > into the > > > + if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) { > > + kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP); > > + return NET_XMIT_DROP; > > + } > > case. > > I'm not sure I understand the busylock logic well enough, so almost > seems to me we shouldn't do this whole thing on the noop_qdisc at all, > e.g. via tagging owner with -2 to say don't do it: > > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3865,9 +3865,11 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > qdisc_run_end(q); > rc = NET_XMIT_SUCCESS; > } else { > - WRITE_ONCE(q->owner, smp_processor_id()); > + if (q->owner != -2) > + WRITE_ONCE(q->owner, smp_processor_id()); > rc = dev_qdisc_enqueue(skb, q, &to_free, txq); > - WRITE_ONCE(q->owner, -1); > + if (q->owner != -2) > + WRITE_ONCE(q->owner, -1); > if (qdisc_run_begin(q)) { > if (unlikely(contended)) { > spin_unlock(&q->busylock); > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 2a637a17061b..e857e4638671 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -657,6 +657,7 @@ static struct netdev_queue noop_netdev_queue = { > }; > > struct Qdisc noop_qdisc = { > + .owner = -2, > .enqueue = noop_enqueue, > .dequeue = noop_dequeue, > .flags = TCQ_F_BUILTIN, > > > (and yes, I believe it doesn't need to be READ_ONCE for the check > against -2 since that's mutually exclusive with all other values) > > Or maybe simply ignoring the value for the noop_qdisc: > > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3822,7 +3822,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > return rc; > } > > - if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) { > + if (unlikely(q != &noop_qdisc && READ_ONCE(q->owner) == smp_processor_id())) { > kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP); > return NET_XMIT_DROP; > } > > That's shorter, but I'm not sure if there might be other special > cases... > > Or maybe someone can think of an even better fix? Why not simply initialize noop_qdisc.owner to -1 ?
On Fri, 2024-06-07 at 16:54 +0200, Eric Dumazet wrote: > > > > I'm not sure I understand the busylock logic well enough > Why not simply initialize noop_qdisc.owner to -1 ? I didn't understand the locking logic, so I was worried you could still have it be used in parallel since it can be assigned to any number of devices. johannes
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 76db6be16083..f561dfb79743 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -117,6 +117,7 @@ struct Qdisc { struct qdisc_skb_head q; struct gnet_stats_basic_sync bstats; struct gnet_stats_queue qstats; + int owner; unsigned long state; unsigned long state2; /* must be written under qdisc spinlock */ struct Qdisc *next_sched; diff --git a/net/core/dev.c b/net/core/dev.c index 854a3a28a8d8..f6c6e494f0a9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3808,6 +3808,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, return rc; } + if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) { + kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP); + return NET_XMIT_DROP; + } /* * Heuristic to force contended enqueues to serialize on a * separate lock before trying to get qdisc main lock. @@ -3847,7 +3851,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, qdisc_run_end(q); rc = NET_XMIT_SUCCESS; } else { + WRITE_ONCE(q->owner, smp_processor_id()); rc = dev_qdisc_enqueue(skb, q, &to_free, txq); + WRITE_ONCE(q->owner, -1); if (qdisc_run_begin(q)) { if (unlikely(contended)) { spin_unlock(&q->busylock); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ff5336493777..4a2c763e2d11 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -974,6 +974,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue, sch->enqueue = ops->enqueue; sch->dequeue = ops->dequeue; sch->dev_queue = dev_queue; + sch->owner = -1; netdev_hold(dev, &sch->dev_tracker, GFP_KERNEL); refcount_set(&sch->refcnt, 1);