Message ID | 20240513100246.85173-1-jianbol@nvidia.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: drop secpath extension before skb deferral free | expand |
On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> wrote: > > In commit 68822bdf76f1 ("net: generalize skb freeing deferral to > per-cpu lists"), skb can be queued on remote cpu list for deferral > free. > > The remote cpu is kicked if the queue reaches half capacity. As > mentioned in the patch, this seems very unlikely to trigger > NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not true, > we actually saw something that indicates this: skb is not freed > immediately, or even kept for a long time. And the possibility is > increased if there are more cpu cores. > > As skb is not freed, its extension is not freed as well. An error > occurred while unloading the driver after running TCP traffic with > IPsec, where both crypto and packet were offloaded. However, in the > case of crypto offload, this failure was rare and significantly more > challenging to replicate. > > unregister_netdevice: waiting for eth2 to become free. Usage count = 2 > ref_tracker: eth%d@000000007421424b has 1/1 users at > xfrm_dev_state_add+0xe5/0x4d0 > xfrm_add_sa+0xc5c/0x11e0 > xfrm_user_rcv_msg+0xfa/0x240 > netlink_rcv_skb+0x54/0x100 > xfrm_netlink_rcv+0x31/0x40 > netlink_unicast+0x1fc/0x2c0 > netlink_sendmsg+0x232/0x4a0 > __sock_sendmsg+0x38/0x60 > ____sys_sendmsg+0x1e3/0x200 > ___sys_sendmsg+0x80/0xc0 > __sys_sendmsg+0x51/0x90 > do_syscall_64+0x40/0xe0 > entry_SYSCALL_64_after_hwframe+0x46/0x4e > > The ref_tracker shows the netdev is hold when the offloading xfrm > state is first added to hardware. When receiving packet, the secpath > extension, which saves xfrm state, is added to skb by ipsec offload, > and the xfrm state is hence hold by the received skb. It can't be > flushed till skb is dequeued from the defer list, then skb and its > extension are really freed. Also, the netdev can't be unregistered > because it still referred by xfrm state. > > To fix this issue, drop this extension before skb is queued to the > defer list, so xfrm state destruction is not blocked. > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > Signed-off-by: Jianbo Liu <jianbol@nvidia.com> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > --- This attribution and patch seem wrong. Also you should CC XFRM maintainers. Before being freed from tcp_recvmsg() path, packets can sit in TCP receive queues for arbitrary amounts of time. secpath_reset() should be called much earlier than in the code you tried to change. If XFRM state can stick to packets stored in protocol queues, we have a much bigger problem. I suspect all callers of nf_reset_ct() need a fix of some kind.
On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> wrote: > > > > In commit 68822bdf76f1 ("net: generalize skb freeing deferral to > > per-cpu lists"), skb can be queued on remote cpu list for deferral > > free. > > > > The remote cpu is kicked if the queue reaches half capacity. As > > mentioned in the patch, this seems very unlikely to trigger > > NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not > > true, > > we actually saw something that indicates this: skb is not freed > > immediately, or even kept for a long time. And the possibility is > > increased if there are more cpu cores. > > > > As skb is not freed, its extension is not freed as well. An error > > occurred while unloading the driver after running TCP traffic with > > IPsec, where both crypto and packet were offloaded. However, in the > > case of crypto offload, this failure was rare and significantly more > > challenging to replicate. > > > > unregister_netdevice: waiting for eth2 to become free. Usage count = > > 2 > > ref_tracker: eth%d@000000007421424b has 1/1 users at > > xfrm_dev_state_add+0xe5/0x4d0 > > xfrm_add_sa+0xc5c/0x11e0 > > xfrm_user_rcv_msg+0xfa/0x240 > > netlink_rcv_skb+0x54/0x100 > > xfrm_netlink_rcv+0x31/0x40 > > netlink_unicast+0x1fc/0x2c0 > > netlink_sendmsg+0x232/0x4a0 > > __sock_sendmsg+0x38/0x60 > > ____sys_sendmsg+0x1e3/0x200 > > ___sys_sendmsg+0x80/0xc0 > > __sys_sendmsg+0x51/0x90 > > do_syscall_64+0x40/0xe0 > > entry_SYSCALL_64_after_hwframe+0x46/0x4e > > > > The ref_tracker shows the netdev is hold when the offloading xfrm > > state is first added to hardware. When receiving packet, the secpath > > extension, which saves xfrm state, is added to skb by ipsec offload, > > and the xfrm state is hence hold by the received skb. It can't be > > flushed till skb is dequeued from the defer list, then skb and its > > extension are really freed. Also, the netdev can't be unregistered > > because it still referred by xfrm state. > > > > To fix this issue, drop this extension before skb is queued to the > > defer list, so xfrm state destruction is not blocked. > > > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu > > lists") > > Signed-off-by: Jianbo Liu <jianbol@nvidia.com> > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > --- > > > This attribution and patch seem wrong. Also you should CC XFRM > maintainers. > > Before being freed from tcp_recvmsg() path, packets can sit in TCP > receive queues for arbitrary amounts of time. > > secpath_reset() should be called much earlier than in the code you > tried to change. Yes, this also fixed the issue if I moved secpatch_reset() before tcp_v4_do_rcv(). --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) tcp_v4_fill_cb(skb, iph, th); skb->dev = NULL; + secpath_reset(skb); if (sk->sk_state == TCP_LISTEN) { ret = tcp_v4_do_rcv(sk, skb); Do you want me to send v2, or push a new one if you agree with this change? Thanks! Jianbo > > If XFRM state can stick to packets stored in protocol queues, we have > a much bigger problem. > > I suspect all callers of nf_reset_ct() need a fix of some kind.
On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> wrote: > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> wrote: > > > > > > In commit 68822bdf76f1 ("net: generalize skb freeing deferral to > > > per-cpu lists"), skb can be queued on remote cpu list for deferral > > > free. > > > > > > The remote cpu is kicked if the queue reaches half capacity. As > > > mentioned in the patch, this seems very unlikely to trigger > > > NET_RX_SOFTIRQ on the remote CPU in this way. But that seems not > > > true, > > > we actually saw something that indicates this: skb is not freed > > > immediately, or even kept for a long time. And the possibility is > > > increased if there are more cpu cores. > > > > > > As skb is not freed, its extension is not freed as well. An error > > > occurred while unloading the driver after running TCP traffic with > > > IPsec, where both crypto and packet were offloaded. However, in the > > > case of crypto offload, this failure was rare and significantly more > > > challenging to replicate. > > > > > > unregister_netdevice: waiting for eth2 to become free. Usage count = > > > 2 > > > ref_tracker: eth%d@000000007421424b has 1/1 users at > > > xfrm_dev_state_add+0xe5/0x4d0 > > > xfrm_add_sa+0xc5c/0x11e0 > > > xfrm_user_rcv_msg+0xfa/0x240 > > > netlink_rcv_skb+0x54/0x100 > > > xfrm_netlink_rcv+0x31/0x40 > > > netlink_unicast+0x1fc/0x2c0 > > > netlink_sendmsg+0x232/0x4a0 > > > __sock_sendmsg+0x38/0x60 > > > ____sys_sendmsg+0x1e3/0x200 > > > ___sys_sendmsg+0x80/0xc0 > > > __sys_sendmsg+0x51/0x90 > > > do_syscall_64+0x40/0xe0 > > > entry_SYSCALL_64_after_hwframe+0x46/0x4e > > > > > > The ref_tracker shows the netdev is hold when the offloading xfrm > > > state is first added to hardware. When receiving packet, the secpath > > > extension, which saves xfrm state, is added to skb by ipsec offload, > > > and the xfrm state is hence hold by the received skb. It can't be > > > flushed till skb is dequeued from the defer list, then skb and its > > > extension are really freed. Also, the netdev can't be unregistered > > > because it still referred by xfrm state. > > > > > > To fix this issue, drop this extension before skb is queued to the > > > defer list, so xfrm state destruction is not blocked. > > > > > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu > > > lists") > > > Signed-off-by: Jianbo Liu <jianbol@nvidia.com> > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > --- > > > > > > This attribution and patch seem wrong. Also you should CC XFRM > > maintainers. > > > > Before being freed from tcp_recvmsg() path, packets can sit in TCP > > receive queues for arbitrary amounts of time. > > > > secpath_reset() should be called much earlier than in the code you > > tried to change. > > Yes, this also fixed the issue if I moved secpatch_reset() before > tcp_v4_do_rcv(). > > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > tcp_v4_fill_cb(skb, iph, th); > > skb->dev = NULL; > + secpath_reset(skb); > > if (sk->sk_state == TCP_LISTEN) { > ret = tcp_v4_do_rcv(sk, skb); > > Do you want me to send v2, or push a new one if you agree with this > change? That would only care about TCP and IPv4. I think we need a full fix, not a partial work around to an immediate problem. Can we have some feedback from Steffen, I wonder if we missed something really obvious. It is hard to believe this has been broken for such a long time. I think the issue came with commit d77e38e612a017480157fe6d2c1422f42cb5b7e3 Author: Steffen Klassert <steffen.klassert@secunet.com> Date: Fri Apr 14 10:06:10 2017 +0200 xfrm: Add an IPsec hardware offloading API This patch adds all the bits that are needed to do IPsec hardware offload for IPsec states and ESP packets. We add xfrmdev_ops to the net_device. xfrmdev_ops has function pointers that are needed to manage the xfrm states in the hardware and to do a per packet offloading decision. Joint work with: Ilan Tayari <ilant@mellanox.com> Guy Shapiro <guysh@mellanox.com> Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> We should probably handle NETDEV_DOWN/NETDEV_UNREGISTER better, instead of adding secpath_reset(skb) there and there.
On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote: > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> > wrote: > > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> > > > wrote: > > > > > > > > ... > > > > > > > > > This attribution and patch seem wrong. Also you should CC XFRM > > > maintainers. > > > > > > Before being freed from tcp_recvmsg() path, packets can sit in > > > TCP > > > receive queues for arbitrary amounts of time. > > > > > > secpath_reset() should be called much earlier than in the code > > > you > > > tried to change. > > > > Yes, this also fixed the issue if I moved secpatch_reset() before > > tcp_v4_do_rcv(). > > > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > > tcp_v4_fill_cb(skb, iph, th); > > > > skb->dev = NULL; > > + secpath_reset(skb); > > > > if (sk->sk_state == TCP_LISTEN) { > > ret = tcp_v4_do_rcv(sk, skb); > > > > Do you want me to send v2, or push a new one if you agree with this > > change? > > That would only care about TCP and IPv4. > > I think we need a full fix, not a partial work around to an immediate > problem. > Do you want to fix the issue if skb with secpath extension is hold in protocol queues? But, is that possible? > Can we have some feedback from Steffen, I wonder if we missed > something really obvious. > > It is hard to believe this has been broken for such a long time. > > I think the issue came with > > commit d77e38e612a017480157fe6d2c1422f42cb5b7e3 > Author: Steffen Klassert <steffen.klassert@secunet.com> > Date: Fri Apr 14 10:06:10 2017 +0200 > > xfrm: Add an IPsec hardware offloading API > > This patch adds all the bits that are needed to do > IPsec hardware offload for IPsec states and ESP packets. > We add xfrmdev_ops to the net_device. xfrmdev_ops has > function pointers that are needed to manage the xfrm > states in the hardware and to do a per packet > offloading decision. > > Joint work with: > Ilan Tayari <ilant@mellanox.com> > Guy Shapiro <guysh@mellanox.com> > Yossi Kuperman <yossiku@mellanox.com> > > Signed-off-by: Guy Shapiro <guysh@mellanox.com> > Signed-off-by: Ilan Tayari <ilant@mellanox.com> > Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> > > We should probably handle NETDEV_DOWN/NETDEV_UNREGISTER better, > And I think we also need to replace xfrm_state_put() with xfrm_state_put_sync() in xfrm_dev_state_flush(). > instead of adding secpath_reset(skb) there and there.
On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote: > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> > wrote: > > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> > > > wrote: > > > > > > ... > > > This attribution and patch seem wrong. Also you should CC XFRM > > > maintainers. > > > > > > Before being freed from tcp_recvmsg() path, packets can sit in > > > TCP > > > receive queues for arbitrary amounts of time. > > > > > > secpath_reset() should be called much earlier than in the code > > > you > > > tried to change. > > > > Yes, this also fixed the issue if I moved secpatch_reset() before > > tcp_v4_do_rcv(). > > > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > > tcp_v4_fill_cb(skb, iph, th); > > > > skb->dev = NULL; > > + secpath_reset(skb); > > > > if (sk->sk_state == TCP_LISTEN) { > > ret = tcp_v4_do_rcv(sk, skb); > > > > Do you want me to send v2, or push a new one if you agree with this > > change? > > That would only care about TCP and IPv4. > > I think we need a full fix, not a partial work around to an immediate > problem. > > Can we have some feedback from Steffen, I wonder if we missed > something really obvious. > > It is hard to believe this has been broken for such a long time. Could you please give me some suggestions? Should I add new function to reset both ct and secpath, and replace nf_reset_ct() where necessary on receive flow? > > I think the issue came with > > commit d77e38e612a017480157fe6d2c1422f42cb5b7e3 > Author: Steffen Klassert <steffen.klassert@secunet.com> > Date: Fri Apr 14 10:06:10 2017 +0200 > > xfrm: Add an IPsec hardware offloading API > > This patch adds all the bits that are needed to do > IPsec hardware offload for IPsec states and ESP packets. > We add xfrmdev_ops to the net_device. xfrmdev_ops has > function pointers that are needed to manage the xfrm > states in the hardware and to do a per packet > offloading decision. > > Joint work with: > Ilan Tayari <ilant@mellanox.com> > Guy Shapiro <guysh@mellanox.com> > Yossi Kuperman <yossiku@mellanox.com> > > Signed-off-by: Guy Shapiro <guysh@mellanox.com> > Signed-off-by: Ilan Tayari <ilant@mellanox.com> > Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> > > We should probably handle NETDEV_DOWN/NETDEV_UNREGISTER better, > instead of adding secpath_reset(skb) there and there.
On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote: > On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote: > > > > I think we need a full fix, not a partial work around to an immediate > > problem. > > > > Can we have some feedback from Steffen, I wonder if we missed > > something really obvious. > > > > It is hard to believe this has been broken for such a long time. > > Could you please give me some suggestions? Sorry for the delay, I've just returned from vacation. I'll have a look at it.
On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote: > On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote: > > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> > > wrote: > > > > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> > > > > wrote: > > > > > > > > > ... > > > > This attribution and patch seem wrong. Also you should CC XFRM > > > > maintainers. > > > > > > > > Before being freed from tcp_recvmsg() path, packets can sit in > > > > TCP > > > > receive queues for arbitrary amounts of time. > > > > > > > > secpath_reset() should be called much earlier than in the code > > > > you > > > > tried to change. > > > > > > Yes, this also fixed the issue if I moved secpatch_reset() before > > > tcp_v4_do_rcv(). > > > > > > --- a/net/ipv4/tcp_ipv4.c > > > +++ b/net/ipv4/tcp_ipv4.c > > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > tcp_v4_fill_cb(skb, iph, th); > > > > > > skb->dev = NULL; > > > + secpath_reset(skb); > > > > > > if (sk->sk_state == TCP_LISTEN) { > > > ret = tcp_v4_do_rcv(sk, skb); > > > > > > Do you want me to send v2, or push a new one if you agree with this > > > change? > > > > That would only care about TCP and IPv4. > > > > I think we need a full fix, not a partial work around to an immediate > > problem. > > > > Can we have some feedback from Steffen, I wonder if we missed > > something really obvious. > > > > It is hard to believe this has been broken for such a long time. > > Could you please give me some suggestions? > Should I add new function to reset both ct and secpath, and replace > nf_reset_ct() where necessary on receive flow? Maybe we should directly remove the device from the xfrm_state when the decice goes down, this should catch all the cases. I think about something like this (untested) patch: diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 0c306473a79d..ba402275ab57 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool task_vali xfrm_state_hold(x); spin_unlock_bh(&net->xfrm.xfrm_state_lock); - err = xfrm_state_delete(x); + spin_lock_bh(&x->lock); + err = __xfrm_state_delete(x); + xfrm_dev_state_free(x); + spin_unlock_bh(&x->lock); + xfrm_audit_state_delete(x, err ? 0 : 1, task_valid); xfrm_state_put(x); The secpath is still attached to all skbs, but the hang on device unregister should go away.
On Wed, May 22, 2024 at 11:34 AM Steffen Klassert <steffen.klassert@secunet.com> wrote: > > On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote: > > On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote: > > > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> > > > wrote: > > > > > > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > > > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu <jianbol@nvidia.com> > > > > > wrote: > > > > > > > > > > > > ... > > > > > This attribution and patch seem wrong. Also you should CC XFRM > > > > > maintainers. > > > > > > > > > > Before being freed from tcp_recvmsg() path, packets can sit in > > > > > TCP > > > > > receive queues for arbitrary amounts of time. > > > > > > > > > > secpath_reset() should be called much earlier than in the code > > > > > you > > > > > tried to change. > > > > > > > > Yes, this also fixed the issue if I moved secpatch_reset() before > > > > tcp_v4_do_rcv(). > > > > > > > > --- a/net/ipv4/tcp_ipv4.c > > > > +++ b/net/ipv4/tcp_ipv4.c > > > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > > tcp_v4_fill_cb(skb, iph, th); > > > > > > > > skb->dev = NULL; > > > > + secpath_reset(skb); > > > > > > > > if (sk->sk_state == TCP_LISTEN) { > > > > ret = tcp_v4_do_rcv(sk, skb); > > > > > > > > Do you want me to send v2, or push a new one if you agree with this > > > > change? > > > > > > That would only care about TCP and IPv4. > > > > > > I think we need a full fix, not a partial work around to an immediate > > > problem. > > > > > > Can we have some feedback from Steffen, I wonder if we missed > > > something really obvious. > > > > > > It is hard to believe this has been broken for such a long time. > > > > Could you please give me some suggestions? > > Should I add new function to reset both ct and secpath, and replace > > nf_reset_ct() where necessary on receive flow? > > Maybe we should directly remove the device from the xfrm_state > when the decice goes down, this should catch all the cases. > > I think about something like this (untested) patch: > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 0c306473a79d..ba402275ab57 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool task_vali > xfrm_state_hold(x); > spin_unlock_bh(&net->xfrm.xfrm_state_lock); > > - err = xfrm_state_delete(x); > + spin_lock_bh(&x->lock); > + err = __xfrm_state_delete(x); > + xfrm_dev_state_free(x); > + spin_unlock_bh(&x->lock); > + > xfrm_audit_state_delete(x, err ? 0 : 1, > task_valid); > xfrm_state_put(x); > > The secpath is still attached to all skbs, but the hang on device > unregister should go away. Seems fine, but I wonder if we need some READ_ONCE(xso->dev) to avoid races then ? diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 77ebf5bcf0b901b7b70875b3ccb5cead14ab1602..55eb5898e9a4263048b70d5a0d5dc274ab784c0f 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1950,9 +1950,10 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x); static inline void xfrm_dev_state_advance_esn(struct xfrm_state *x) { struct xfrm_dev_offload *xso = &x->xso; + struct net_device *dev = READ_ONCE(xso->dev); - if (xso->dev && xso->dev->xfrmdev_ops->xdo_dev_state_advance_esn) - xso->dev->xfrmdev_ops->xdo_dev_state_advance_esn(x); + if (dev && dev->xfrmdev_ops->xdo_dev_state_advance_esn) + dev->xfrmdev_ops->xdo_dev_state_advance_esn(x); } static inline bool xfrm_dst_offload_ok(struct dst_entry *dst) @@ -1976,20 +1977,21 @@ static inline bool xfrm_dst_offload_ok(struct dst_entry *dst) static inline void xfrm_dev_state_delete(struct xfrm_state *x) { struct xfrm_dev_offload *xso = &x->xso; + struct net_device *dev = READ_ONCE(xso->dev); - if (xso->dev) - xso->dev->xfrmdev_ops->xdo_dev_state_delete(x); + if (dev) + dev->xfrmdev_ops->xdo_dev_state_delete(x); } static inline void xfrm_dev_state_free(struct xfrm_state *x) { struct xfrm_dev_offload *xso = &x->xso; - struct net_device *dev = xso->dev; + struct net_device *dev = READ_ONCE(xso->dev); if (dev && dev->xfrmdev_ops) { if (dev->xfrmdev_ops->xdo_dev_state_free) dev->xfrmdev_ops->xdo_dev_state_free(x); - xso->dev = NULL; + WRITE_ONCE(xso->dev, NULL); xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED; netdev_put(dev, &xso->dev_tracker); }
On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote: > On Mon, May 20, 2024 at 10:06:24AM +0000, Jianbo Liu wrote: > > On Tue, 2024-05-14 at 10:51 +0200, Eric Dumazet wrote: > > > On Tue, May 14, 2024 at 9:37 AM Jianbo Liu <jianbol@nvidia.com> > > > wrote: > > > > > > > > On Mon, 2024-05-13 at 12:29 +0200, Eric Dumazet wrote: > > > > > On Mon, May 13, 2024 at 12:04 PM Jianbo Liu < > > > > > jianbol@nvidia.com> > > > > > wrote: > > > > > > > > > > > > ... > > > > > This attribution and patch seem wrong. Also you should CC > > > > > XFRM > > > > > maintainers. > > > > > > > > > > Before being freed from tcp_recvmsg() path, packets can sit > > > > > in > > > > > TCP > > > > > receive queues for arbitrary amounts of time. > > > > > > > > > > secpath_reset() should be called much earlier than in the > > > > > code > > > > > you > > > > > tried to change. > > > > > > > > Yes, this also fixed the issue if I moved secpatch_reset() > > > > before > > > > tcp_v4_do_rcv(). > > > > > > > > --- a/net/ipv4/tcp_ipv4.c > > > > +++ b/net/ipv4/tcp_ipv4.c > > > > @@ -2314,6 +2314,7 @@ int tcp_v4_rcv(struct sk_buff *skb) > > > > tcp_v4_fill_cb(skb, iph, th); > > > > > > > > skb->dev = NULL; > > > > + secpath_reset(skb); > > > > > > > > if (sk->sk_state == TCP_LISTEN) { > > > > ret = tcp_v4_do_rcv(sk, skb); > > > > > > > > Do you want me to send v2, or push a new one if you agree with > > > > this > > > > change? > > > > > > That would only care about TCP and IPv4. > > > > > > I think we need a full fix, not a partial work around to an > > > immediate > > > problem. > > > > > > Can we have some feedback from Steffen, I wonder if we missed > > > something really obvious. > > > > > > It is hard to believe this has been broken for such a long time. > > > > Could you please give me some suggestions? > > Should I add new function to reset both ct and secpath, and replace > > nf_reset_ct() where necessary on receive flow? > > Maybe we should directly remove the device from the xfrm_state > when the decice goes down, this should catch all the cases. > > I think about something like this (untested) patch: > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 0c306473a79d..ba402275ab57 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct > net_device *dev, bool task_vali > xfrm_state_hold(x); > spin_unlock_bh(&net- > >xfrm.xfrm_state_lock); > > - err = xfrm_state_delete(x); > + spin_lock_bh(&x->lock); > + err = __xfrm_state_delete(x); > + xfrm_dev_state_free(x); > + spin_unlock_bh(&x->lock); > + > xfrm_audit_state_delete(x, err ? 0 : > 1, > task_valid); > xfrm_state_put(x); > > The secpath is still attached to all skbs, but the hang on device > unregister should go away. It didn't fix the issue. I run these commands before unregister netdev: ip x s delall ip x p delall ip x s f ip x p f
On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote: > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote: > > > > Maybe we should directly remove the device from the xfrm_state > > when the decice goes down, this should catch all the cases. > > > > I think about something like this (untested) patch: > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > index 0c306473a79d..ba402275ab57 100644 > > --- a/net/xfrm/xfrm_state.c > > +++ b/net/xfrm/xfrm_state.c > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct > > net_device *dev, bool task_vali > > xfrm_state_hold(x); > > spin_unlock_bh(&net- > > >xfrm.xfrm_state_lock); > > > > - err = xfrm_state_delete(x); > > + spin_lock_bh(&x->lock); > > + err = __xfrm_state_delete(x); > > + xfrm_dev_state_free(x); > > + spin_unlock_bh(&x->lock); > > + > > xfrm_audit_state_delete(x, err ? 0 : > > 1, > > task_valid); > > xfrm_state_put(x); > > > > The secpath is still attached to all skbs, but the hang on device > > unregister should go away. > > It didn't fix the issue. Do you have a backtrace of the ref_tracker? Is that with packet offload? Looks like we need to remove the device from the xfrm_policy too if packet offload is used.
On Thu, 2024-05-23 at 08:44 +0200, Steffen Klassert wrote: > On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote: > > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote: > > > > > > Maybe we should directly remove the device from the xfrm_state > > > when the decice goes down, this should catch all the cases. > > > > > > I think about something like this (untested) patch: > > > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > > index 0c306473a79d..ba402275ab57 100644 > > > --- a/net/xfrm/xfrm_state.c > > > +++ b/net/xfrm/xfrm_state.c > > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, > > > struct > > > net_device *dev, bool task_vali > > > xfrm_state_hold(x); > > > spin_unlock_bh(&net- > > > > xfrm.xfrm_state_lock); > > > > > > - err = xfrm_state_delete(x); > > > + spin_lock_bh(&x->lock); > > > + err = __xfrm_state_delete(x); > > > + xfrm_dev_state_free(x); > > > + spin_unlock_bh(&x->lock); > > > + > > > xfrm_audit_state_delete(x, err ? 0 > > > : > > > 1, > > > task_valid) > > > ; > > > xfrm_state_put(x); > > > > > > The secpath is still attached to all skbs, but the hang on device > > > unregister should go away. > > > > It didn't fix the issue. > > Do you have a backtrace of the ref_tracker? > > Is that with packet offload? > Yes. And it's the same trace I posted before. ref_tracker: eth%d@000000007421424b has 1/1 users at xfrm_dev_state_add+0xe5/0x4d0 xfrm_add_sa+0xc5c/0x11e0 xfrm_user_rcv_msg+0xfa/0x240 netlink_rcv_skb+0x54/0x100 xfrm_netlink_rcv+0x31/0x40 netlink_unicast+0x1fc/0x2c0 netlink_sendmsg+0x232/0x4a0 __sock_sendmsg+0x38/0x60 ____sys_sendmsg+0x1e3/0x200 ___sys_sendmsg+0x80/0xc0 __sys_sendmsg+0x51/0x90 do_syscall_64+0x40/0xe0 entry_SYSCALL_64_after_hwframe+0x46/0x4e > Looks like we need to remove the device from the xfrm_policy > too if packet offload is used.
On Thu, May 23, 2024 at 06:57:25AM +0000, Jianbo Liu wrote: > On Thu, 2024-05-23 at 08:44 +0200, Steffen Klassert wrote: > > On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote: > > > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote: > > > > > > > > Maybe we should directly remove the device from the xfrm_state > > > > when the decice goes down, this should catch all the cases. > > > > > > > > I think about something like this (untested) patch: > > > > > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > > > index 0c306473a79d..ba402275ab57 100644 > > > > --- a/net/xfrm/xfrm_state.c > > > > +++ b/net/xfrm/xfrm_state.c > > > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, > > > > struct > > > > net_device *dev, bool task_vali > > > > xfrm_state_hold(x); > > > > spin_unlock_bh(&net- > > > > > xfrm.xfrm_state_lock); > > > > > > > > - err = xfrm_state_delete(x); > > > > + spin_lock_bh(&x->lock); > > > > + err = __xfrm_state_delete(x); > > > > + xfrm_dev_state_free(x); > > > > + spin_unlock_bh(&x->lock); > > > > + > > > > xfrm_audit_state_delete(x, err ? 0 > > > > : > > > > 1, > > > > task_valid) > > > > ; > > > > xfrm_state_put(x); > > > > > > > > The secpath is still attached to all skbs, but the hang on device > > > > unregister should go away. > > > > > > It didn't fix the issue. > > > > Do you have a backtrace of the ref_tracker? > > > > Is that with packet offload? > > > > Yes. And it's the same trace I posted before. > > ref_tracker: eth%d@000000007421424b has 1/1 users at > xfrm_dev_state_add+0xe5/0x4d0 Hm, interesting. Can you check if xfrm_dev_state_free() is triggered in that codepath and if it actually removes the device from the states?
On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote: > On Thu, May 23, 2024 at 06:57:25AM +0000, Jianbo Liu wrote: > > On Thu, 2024-05-23 at 08:44 +0200, Steffen Klassert wrote: > > > On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote: > > > > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote: > > > > > > > > > > Maybe we should directly remove the device from the > > > > > xfrm_state > > > > > when the decice goes down, this should catch all the cases. > > > > > > > > > > I think about something like this (untested) patch: > > > > > > > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > > > > index 0c306473a79d..ba402275ab57 100644 > > > > > --- a/net/xfrm/xfrm_state.c > > > > > +++ b/net/xfrm/xfrm_state.c > > > > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net > > > > > *net, > > > > > struct > > > > > net_device *dev, bool task_vali > > > > > xfrm_state_hold(x); > > > > > spin_unlock_bh(&net- > > > > > > xfrm.xfrm_state_lock); > > > > > > > > > > - err = xfrm_state_delete(x); > > > > > + spin_lock_bh(&x->lock); > > > > > + err = __xfrm_state_delete(x); > > > > > + xfrm_dev_state_free(x); > > > > > + spin_unlock_bh(&x->lock); > > > > > + > > > > > xfrm_audit_state_delete(x, > > > > > err ? 0 > > > > > : > > > > > 1, > > > > > task_ > > > > > valid) > > > > > ; > > > > > xfrm_state_put(x); > > > > > > > > > > The secpath is still attached to all skbs, but the hang on > > > > > device > > > > > unregister should go away. > > > > > > > > It didn't fix the issue. > > > > > > Do you have a backtrace of the ref_tracker? > > > > > > Is that with packet offload? > > > > > > > Yes. And it's the same trace I posted before. > > > > ref_tracker: eth%d@000000007421424b has 1/1 users at > > xfrm_dev_state_add+0xe5/0x4d0 > > Hm, interesting. > > Can you check if xfrm_dev_state_free() is triggered in that codepath > and if it actually removes the device from the states? > xfrm_dev_state_free is not triggered. I think it's because I did "ip x s delall" before unregister netdev. Besides, as it's possible to sleep in dev's xdo_dev_state_free, there is a "scheduling while atomic" issue to call xfrm_dev_state_free in spin_lock.
On Thu, May 23, 2024 at 08:44:56AM +0200, Steffen Klassert wrote: > On Thu, May 23, 2024 at 02:22:38AM +0000, Jianbo Liu wrote: > > On Wed, 2024-05-22 at 11:34 +0200, Steffen Klassert wrote: > > > > > > Maybe we should directly remove the device from the xfrm_state > > > when the decice goes down, this should catch all the cases. > > > > > > I think about something like this (untested) patch: > > > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > > index 0c306473a79d..ba402275ab57 100644 > > > --- a/net/xfrm/xfrm_state.c > > > +++ b/net/xfrm/xfrm_state.c > > > @@ -867,7 +867,11 @@ int xfrm_dev_state_flush(struct net *net, struct > > > net_device *dev, bool task_vali > > > xfrm_state_hold(x); > > > spin_unlock_bh(&net- > > > >xfrm.xfrm_state_lock); > > > > > > - err = xfrm_state_delete(x); > > > + spin_lock_bh(&x->lock); > > > + err = __xfrm_state_delete(x); > > > + xfrm_dev_state_free(x); > > > + spin_unlock_bh(&x->lock); > > > + > > > xfrm_audit_state_delete(x, err ? 0 : > > > 1, > > > task_valid); > > > xfrm_state_put(x); > > > > > > The secpath is still attached to all skbs, but the hang on device > > > unregister should go away. > > > > It didn't fix the issue. > > Do you have a backtrace of the ref_tracker? > > Is that with packet offload? We saw same failure with crypto and packet offloads. > > Looks like we need to remove the device from the xfrm_policy > too if packet offload is used. When we debugged it, we found that this line [1] was responsible for elevated reference count. XFRM policy cleaned properly. Thanks [1] https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c#L332 >
On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote: > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote: > > > > Hm, interesting. > > > > Can you check if xfrm_dev_state_free() is triggered in that codepath > > and if it actually removes the device from the states? > > > > xfrm_dev_state_free is not triggered. I think it's because I did "ip x > s delall" before unregister netdev. Yes, likely. So we can't defer the device removal to the state free functions, we always need to do that on state delete. > Besides, as it's possible to sleep in dev's xdo_dev_state_free, there > is a "scheduling while atomic" issue to call xfrm_dev_state_free in > spin_lock. Thanks for the hint!
On Mon, May 27, 2024 at 09:40:23AM +0200, Steffen Klassert wrote: > On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote: > > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote: > > > > > > Hm, interesting. > > > > > > Can you check if xfrm_dev_state_free() is triggered in that codepath > > > and if it actually removes the device from the states? > > > > > > > xfrm_dev_state_free is not triggered. I think it's because I did "ip x > > s delall" before unregister netdev. > > Yes, likely. So we can't defer the device removal to the state free > functions, we always need to do that on state delete. The only (not too complicated) solution I see so far is to free the device early, along with the state delete function: diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 649bb739df0d..bfc71d2daa6a 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -721,6 +721,7 @@ int __xfrm_state_delete(struct xfrm_state *x) sock_put(rcu_dereference_raw(x->encap_sk)); xfrm_dev_state_delete(x); + xfrm_dev_state_free(x); /* All xfrm_state objects are created by xfrm_state_alloc. * The xfrm_state_alloc call gives a reference, and that
On Tue, 2024-05-28 at 10:44 +0200, Steffen Klassert wrote: > On Mon, May 27, 2024 at 09:40:23AM +0200, Steffen Klassert wrote: > > On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote: > > > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote: > > > > > > > > Hm, interesting. > > > > > > > > Can you check if xfrm_dev_state_free() is triggered in that > > > > codepath > > > > and if it actually removes the device from the states? > > > > > > > > > > xfrm_dev_state_free is not triggered. I think it's because I did > > > "ip x > > > s delall" before unregister netdev. > > > > Yes, likely. So we can't defer the device removal to the state free > > functions, we always need to do that on state delete. > > The only (not too complicated) solution I see so far is to > free the device early, along with the state delete function: > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 649bb739df0d..bfc71d2daa6a 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -721,6 +721,7 @@ int __xfrm_state_delete(struct xfrm_state *x) > sock_put(rcu_dereference_raw(x->encap_sk)); > > xfrm_dev_state_delete(x); > + xfrm_dev_state_free(x); > Still hit "scheduling while atomic" issue because __xfrm_state_delete is called in state's spin lock. > /* All xfrm_state objects are created by > xfrm_state_alloc. > * The xfrm_state_alloc call gives a reference, and > that
On Tue, May 28, 2024 at 09:02:49AM +0000, Jianbo Liu wrote: > On Tue, 2024-05-28 at 10:44 +0200, Steffen Klassert wrote: > > On Mon, May 27, 2024 at 09:40:23AM +0200, Steffen Klassert wrote: > > > On Thu, May 23, 2024 at 03:26:22PM +0000, Jianbo Liu wrote: > > > > On Thu, 2024-05-23 at 12:00 +0200, Steffen Klassert wrote: > > > > > > > > > > Hm, interesting. > > > > > > > > > > Can you check if xfrm_dev_state_free() is triggered in that > > > > > codepath > > > > > and if it actually removes the device from the states? > > > > > > > > > > > > > xfrm_dev_state_free is not triggered. I think it's because I did > > > > "ip x > > > > s delall" before unregister netdev. > > > > > > Yes, likely. So we can't defer the device removal to the state free > > > functions, we always need to do that on state delete. > > > > The only (not too complicated) solution I see so far is to > > free the device early, along with the state delete function: > > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > index 649bb739df0d..bfc71d2daa6a 100644 > > --- a/net/xfrm/xfrm_state.c > > +++ b/net/xfrm/xfrm_state.c > > @@ -721,6 +721,7 @@ int __xfrm_state_delete(struct xfrm_state *x) > > sock_put(rcu_dereference_raw(x->encap_sk)); > > > > xfrm_dev_state_delete(x); > > + xfrm_dev_state_free(x); > > > > Still hit "scheduling while atomic" issue because __xfrm_state_delete > is called in state's spin lock. Grmpf. Yes, apparently. Unfortunately I don't have a NIC to test installed currently.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b99127712e67..d7f5024f3c08 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -7025,6 +7025,10 @@ nodefer: __kfree_skb(skb); if (READ_ONCE(sd->defer_count) >= defer_max) goto nodefer; +#ifdef CONFIG_XFRM + skb_ext_del(skb, SKB_EXT_SEC_PATH); +#endif + spin_lock_bh(&sd->defer_lock); /* Send an IPI every time queue reaches half capacity. */ kick = sd->defer_count == (defer_max >> 1);