Message ID | 20201210153618.21226-1-magnus.karlsson@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf] xsk: fix race in SKB mode transmit with shared cq | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for bpf |
netdev/subject_prefix | success | Link |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 53 this patch: 53 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: line length of 81 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 53 this patch: 53 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Thu, Dec 10, 2020 at 7:36 AM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > From: Magnus Karlsson <magnus.karlsson@intel.com> > > Fix a race when multiple sockets are simultaneously calling sendto() > when the completion ring is shared in the SKB case. This is the case > when you share the same netdev and queue id through the > XDP_SHARED_UMEM bind flag. The problem is that multiple processes can > be in xsk_generic_xmit() and call the backpressure mechanism in > xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this > specific scenario, a race might occur since the rings are > single-producer single-consumer. > > Fix this by moving the tx_completion_lock from the socket to the pool > as the pool is shared between the sockets that share the completion > ring. (The pool is not shared when this is not the case.) And then > protect the accesses to xskq_prod_reserve() with this lock. The > tx_completion_lock is renamed cq_lock to better reflect that it > protects accesses to the potentially shared completion ring. > > Fixes: 35fcde7f8deb ("xsk: support for Tx") > Fixes: a9744f7ca200 ("xsk: fix potential race in SKB TX completion code") > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> > Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > --- > include/net/xdp_sock.h | 4 ---- > include/net/xsk_buff_pool.h | 5 +++++ > net/xdp/xsk.c | 9 ++++++--- > net/xdp/xsk_buff_pool.c | 1 + > 4 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > index 4f4e93bf814c..cc17bc957548 100644 > --- a/include/net/xdp_sock.h > +++ b/include/net/xdp_sock.h > @@ -58,10 +58,6 @@ struct xdp_sock { > > struct xsk_queue *tx ____cacheline_aligned_in_smp; > struct list_head tx_list; > - /* Mutual exclusion of NAPI TX thread and sendmsg error paths > - * in the SKB destructor callback. > - */ > - spinlock_t tx_completion_lock; > /* Protects generic receive. */ > spinlock_t rx_lock; > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h > index 01755b838c74..eaa8386dbc63 100644 > --- a/include/net/xsk_buff_pool.h > +++ b/include/net/xsk_buff_pool.h > @@ -73,6 +73,11 @@ struct xsk_buff_pool { > bool dma_need_sync; > bool unaligned; > void *addrs; > + /* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect: > + * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when > + * sockets share a single cq when the same netdev and queue id is shared. > + */ > + spinlock_t cq_lock; > struct xdp_buff_xsk *free_heads[]; > }; > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 62504471fd20..42cb5f94d49e 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -364,9 +364,9 @@ static void xsk_destruct_skb(struct sk_buff *skb) > struct xdp_sock *xs = xdp_sk(skb->sk); > unsigned long flags; > > - spin_lock_irqsave(&xs->tx_completion_lock, flags); > + spin_lock_irqsave(&xs->pool->cq_lock, flags); > xskq_prod_submit_addr(xs->pool->cq, addr); > - spin_unlock_irqrestore(&xs->tx_completion_lock, flags); > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > sock_wfree(skb); > } > @@ -378,6 +378,7 @@ static int xsk_generic_xmit(struct sock *sk) > bool sent_frame = false; > struct xdp_desc desc; > struct sk_buff *skb; > + unsigned long flags; > int err = 0; > > mutex_lock(&xs->mutex); > @@ -409,10 +410,13 @@ static int xsk_generic_xmit(struct sock *sk) > * if there is space in it. This avoids having to implement > * any buffering in the Tx path. > */ > + spin_lock_irqsave(&xs->pool->cq_lock, flags); > if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > kfree_skb(skb); > goto out; > } > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); Lock/unlock for every packet? Do you have any performance concerns?
On Thu, Dec 10, 2020 at 10:53 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 7:36 AM Magnus Karlsson > <magnus.karlsson@gmail.com> wrote: > > > > From: Magnus Karlsson <magnus.karlsson@intel.com> > > > > Fix a race when multiple sockets are simultaneously calling sendto() > > when the completion ring is shared in the SKB case. This is the case > > when you share the same netdev and queue id through the > > XDP_SHARED_UMEM bind flag. The problem is that multiple processes can > > be in xsk_generic_xmit() and call the backpressure mechanism in > > xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this > > specific scenario, a race might occur since the rings are > > single-producer single-consumer. > > > > Fix this by moving the tx_completion_lock from the socket to the pool > > as the pool is shared between the sockets that share the completion > > ring. (The pool is not shared when this is not the case.) And then > > protect the accesses to xskq_prod_reserve() with this lock. The > > tx_completion_lock is renamed cq_lock to better reflect that it > > protects accesses to the potentially shared completion ring. > > > > Fixes: 35fcde7f8deb ("xsk: support for Tx") > > Fixes: a9744f7ca200 ("xsk: fix potential race in SKB TX completion code") > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> > > Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > --- > > include/net/xdp_sock.h | 4 ---- > > include/net/xsk_buff_pool.h | 5 +++++ > > net/xdp/xsk.c | 9 ++++++--- > > net/xdp/xsk_buff_pool.c | 1 + > > 4 files changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > > index 4f4e93bf814c..cc17bc957548 100644 > > --- a/include/net/xdp_sock.h > > +++ b/include/net/xdp_sock.h > > @@ -58,10 +58,6 @@ struct xdp_sock { > > > > struct xsk_queue *tx ____cacheline_aligned_in_smp; > > struct list_head tx_list; > > - /* Mutual exclusion of NAPI TX thread and sendmsg error paths > > - * in the SKB destructor callback. > > - */ > > - spinlock_t tx_completion_lock; > > /* Protects generic receive. */ > > spinlock_t rx_lock; > > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h > > index 01755b838c74..eaa8386dbc63 100644 > > --- a/include/net/xsk_buff_pool.h > > +++ b/include/net/xsk_buff_pool.h > > @@ -73,6 +73,11 @@ struct xsk_buff_pool { > > bool dma_need_sync; > > bool unaligned; > > void *addrs; > > + /* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect: > > + * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when > > + * sockets share a single cq when the same netdev and queue id is shared. > > + */ > > + spinlock_t cq_lock; > > struct xdp_buff_xsk *free_heads[]; > > }; > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > index 62504471fd20..42cb5f94d49e 100644 > > --- a/net/xdp/xsk.c > > +++ b/net/xdp/xsk.c > > @@ -364,9 +364,9 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > struct xdp_sock *xs = xdp_sk(skb->sk); > > unsigned long flags; > > > > - spin_lock_irqsave(&xs->tx_completion_lock, flags); > > + spin_lock_irqsave(&xs->pool->cq_lock, flags); > > xskq_prod_submit_addr(xs->pool->cq, addr); > > - spin_unlock_irqrestore(&xs->tx_completion_lock, flags); > > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > > > sock_wfree(skb); > > } > > @@ -378,6 +378,7 @@ static int xsk_generic_xmit(struct sock *sk) > > bool sent_frame = false; > > struct xdp_desc desc; > > struct sk_buff *skb; > > + unsigned long flags; > > int err = 0; > > > > mutex_lock(&xs->mutex); > > @@ -409,10 +410,13 @@ static int xsk_generic_xmit(struct sock *sk) > > * if there is space in it. This avoids having to implement > > * any buffering in the Tx path. > > */ > > + spin_lock_irqsave(&xs->pool->cq_lock, flags); > > if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { > > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > kfree_skb(skb); > > goto out; > > } > > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > Lock/unlock for every packet? > Do you have any performance concerns? I have measured and the performance impact of this is in the noise. There is unfortunately already a lot of code being executed per packet in this path. On the positive side, once this bug fix trickles down to bpf-next, I will submit a rather simple patch to this function that improves throughput by 15% for the txpush microbenchmark. This will more than compensate for any locking introduced.
On Fri, Dec 11, 2020 at 8:47 AM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 10:53 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Dec 10, 2020 at 7:36 AM Magnus Karlsson > > <magnus.karlsson@gmail.com> wrote: > > > > > > From: Magnus Karlsson <magnus.karlsson@intel.com> > > > > > > Fix a race when multiple sockets are simultaneously calling sendto() > > > when the completion ring is shared in the SKB case. This is the case > > > when you share the same netdev and queue id through the > > > XDP_SHARED_UMEM bind flag. The problem is that multiple processes can > > > be in xsk_generic_xmit() and call the backpressure mechanism in > > > xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this > > > specific scenario, a race might occur since the rings are > > > single-producer single-consumer. > > > > > > Fix this by moving the tx_completion_lock from the socket to the pool > > > as the pool is shared between the sockets that share the completion > > > ring. (The pool is not shared when this is not the case.) And then > > > protect the accesses to xskq_prod_reserve() with this lock. The > > > tx_completion_lock is renamed cq_lock to better reflect that it > > > protects accesses to the potentially shared completion ring. > > > > > > Fixes: 35fcde7f8deb ("xsk: support for Tx") > > > Fixes: a9744f7ca200 ("xsk: fix potential race in SKB TX completion code") > > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> > > > Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > --- > > > include/net/xdp_sock.h | 4 ---- > > > include/net/xsk_buff_pool.h | 5 +++++ > > > net/xdp/xsk.c | 9 ++++++--- > > > net/xdp/xsk_buff_pool.c | 1 + > > > 4 files changed, 12 insertions(+), 7 deletions(-) > > > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > > > index 4f4e93bf814c..cc17bc957548 100644 > > > --- a/include/net/xdp_sock.h > > > +++ b/include/net/xdp_sock.h > > > @@ -58,10 +58,6 @@ struct xdp_sock { > > > > > > struct xsk_queue *tx ____cacheline_aligned_in_smp; > > > struct list_head tx_list; > > > - /* Mutual exclusion of NAPI TX thread and sendmsg error paths > > > - * in the SKB destructor callback. > > > - */ > > > - spinlock_t tx_completion_lock; > > > /* Protects generic receive. */ > > > spinlock_t rx_lock; > > > > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h > > > index 01755b838c74..eaa8386dbc63 100644 > > > --- a/include/net/xsk_buff_pool.h > > > +++ b/include/net/xsk_buff_pool.h > > > @@ -73,6 +73,11 @@ struct xsk_buff_pool { > > > bool dma_need_sync; > > > bool unaligned; > > > void *addrs; > > > + /* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect: > > > + * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when > > > + * sockets share a single cq when the same netdev and queue id is shared. > > > + */ > > > + spinlock_t cq_lock; > > > struct xdp_buff_xsk *free_heads[]; > > > }; > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > index 62504471fd20..42cb5f94d49e 100644 > > > --- a/net/xdp/xsk.c > > > +++ b/net/xdp/xsk.c > > > @@ -364,9 +364,9 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > > struct xdp_sock *xs = xdp_sk(skb->sk); > > > unsigned long flags; > > > > > > - spin_lock_irqsave(&xs->tx_completion_lock, flags); > > > + spin_lock_irqsave(&xs->pool->cq_lock, flags); > > > xskq_prod_submit_addr(xs->pool->cq, addr); > > > - spin_unlock_irqrestore(&xs->tx_completion_lock, flags); > > > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > > > > > sock_wfree(skb); > > > } > > > @@ -378,6 +378,7 @@ static int xsk_generic_xmit(struct sock *sk) > > > bool sent_frame = false; > > > struct xdp_desc desc; > > > struct sk_buff *skb; > > > + unsigned long flags; > > > int err = 0; > > > > > > mutex_lock(&xs->mutex); > > > @@ -409,10 +410,13 @@ static int xsk_generic_xmit(struct sock *sk) > > > * if there is space in it. This avoids having to implement > > > * any buffering in the Tx path. > > > */ > > > + spin_lock_irqsave(&xs->pool->cq_lock, flags); > > > if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { > > > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > > kfree_skb(skb); > > > goto out; > > > } > > > + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > > > Lock/unlock for every packet? > > Do you have any performance concerns? > > I have measured and the performance impact of this is in the noise. > There is unfortunately already a lot of code being executed per packet > in this path. On the positive side, once this bug fix trickles down to > bpf-next, I will submit a rather simple patch to this function that > improves throughput by 15% for the txpush microbenchmark. This will > more than compensate for any locking introduced. Please ignore/drop this patch. I will include this one as patch 1 in a patch set of 2 that I will submit. The second patch will be a fix to the reservation problem spotted by Xuan and it needs the locking of the first patch to be able to work.
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 4f4e93bf814c..cc17bc957548 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -58,10 +58,6 @@ struct xdp_sock { struct xsk_queue *tx ____cacheline_aligned_in_smp; struct list_head tx_list; - /* Mutual exclusion of NAPI TX thread and sendmsg error paths - * in the SKB destructor callback. - */ - spinlock_t tx_completion_lock; /* Protects generic receive. */ spinlock_t rx_lock; diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 01755b838c74..eaa8386dbc63 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -73,6 +73,11 @@ struct xsk_buff_pool { bool dma_need_sync; bool unaligned; void *addrs; + /* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect: + * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when + * sockets share a single cq when the same netdev and queue id is shared. + */ + spinlock_t cq_lock; struct xdp_buff_xsk *free_heads[]; }; diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 62504471fd20..42cb5f94d49e 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -364,9 +364,9 @@ static void xsk_destruct_skb(struct sk_buff *skb) struct xdp_sock *xs = xdp_sk(skb->sk); unsigned long flags; - spin_lock_irqsave(&xs->tx_completion_lock, flags); + spin_lock_irqsave(&xs->pool->cq_lock, flags); xskq_prod_submit_addr(xs->pool->cq, addr); - spin_unlock_irqrestore(&xs->tx_completion_lock, flags); + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); sock_wfree(skb); } @@ -378,6 +378,7 @@ static int xsk_generic_xmit(struct sock *sk) bool sent_frame = false; struct xdp_desc desc; struct sk_buff *skb; + unsigned long flags; int err = 0; mutex_lock(&xs->mutex); @@ -409,10 +410,13 @@ static int xsk_generic_xmit(struct sock *sk) * if there is space in it. This avoids having to implement * any buffering in the Tx path. */ + spin_lock_irqsave(&xs->pool->cq_lock, flags); if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); kfree_skb(skb); goto out; } + spin_unlock_irqrestore(&xs->pool->cq_lock, flags); skb->dev = xs->dev; skb->priority = sk->sk_priority; @@ -1193,7 +1197,6 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol, xs->state = XSK_READY; mutex_init(&xs->mutex); spin_lock_init(&xs->rx_lock); - spin_lock_init(&xs->tx_completion_lock); INIT_LIST_HEAD(&xs->map_list); spin_lock_init(&xs->map_list_lock); diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index d5adeee9d5d9..7da28566ac11 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -71,6 +71,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, INIT_LIST_HEAD(&pool->free_list); INIT_LIST_HEAD(&pool->xsk_tx_list); spin_lock_init(&pool->xsk_tx_list_lock); + spin_lock_init(&pool->cq_lock); refcount_set(&pool->users, 1); pool->fq = xs->fq_tmp;