diff mbox series

[net] xsk: correct tx_ring_empty_descs count statistics

Message ID 20250329061548.1357925-1-wangliang74@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] xsk: correct tx_ring_empty_descs count statistics | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 14 of 14 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-04-01--00-00 (tests: 902)

Commit Message

Wang Liang March 29, 2025, 6:15 a.m. UTC
The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
option but do not reserve tx ring. Because xsk_poll() try to wakeup the
driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
tx_ring_empty_descs count increases once the xsk_poll()is called:

  xsk_poll
    xsk_generic_xmit
      __xsk_generic_xmit
        xskq_cons_peek_desc
          xskq_cons_read_desc
            q->queue_empty_descs++;

To avoid this count error, add check for tx descs before send msg in poll.

Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stanislav Fomichev March 31, 2025, 3:22 p.m. UTC | #1
On 03/29, Wang Liang wrote:
> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> tx_ring_empty_descs count increases once the xsk_poll()is called:
> 
>   xsk_poll
>     xsk_generic_xmit
>       __xsk_generic_xmit
>         xskq_cons_peek_desc
>           xskq_cons_read_desc
>             q->queue_empty_descs++;
> 
> To avoid this count error, add check for tx descs before send msg in poll.
> 
> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> Signed-off-by: Wang Liang <wangliang74@huawei.com>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Stanislav Fomichev March 31, 2025, 10:03 p.m. UTC | #2
On 03/31, Stanislav Fomichev wrote:
> On 03/29, Wang Liang wrote:
> > The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> > option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> > driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> > tx_ring_empty_descs count increases once the xsk_poll()is called:
> > 
> >   xsk_poll
> >     xsk_generic_xmit
> >       __xsk_generic_xmit
> >         xskq_cons_peek_desc
> >           xskq_cons_read_desc
> >             q->queue_empty_descs++;
> > 
> > To avoid this count error, add check for tx descs before send msg in poll.
> > 
> > Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> > Signed-off-by: Wang Liang <wangliang74@huawei.com>
> 
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>

Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
cached prod/cons. How is it supposed to work when the actual tx
descriptor is posted? Is there anything besides xskq_cons_peek_desc from
__xsk_generic_xmit that refreshes cached_prod?
Wang Liang April 1, 2025, 2:35 a.m. UTC | #3
在 2025/4/1 6:03, Stanislav Fomichev 写道:
> On 03/31, Stanislav Fomichev wrote:
>> On 03/29, Wang Liang wrote:
>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
>>>
>>>    xsk_poll
>>>      xsk_generic_xmit
>>>        __xsk_generic_xmit
>>>          xskq_cons_peek_desc
>>>            xskq_cons_read_desc
>>>              q->queue_empty_descs++;
>>>
>>> To avoid this count error, add check for tx descs before send msg in poll.
>>>
>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
>>> Signed-off-by: Wang Liang <wangliang74@huawei.com>
>> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
> cached prod/cons. How is it supposed to work when the actual tx
> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
> __xsk_generic_xmit that refreshes cached_prod?


Yes, you are right!

How about using xskq_cons_nb_entries() to check free descriptors?

Like this:


diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e5d104ce7b82..babb7928d335 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct 
socket *sock,
         if (pool->cached_need_wakeup) {
                 if (xs->zc)
                         xsk_wakeup(xs, pool->cached_need_wakeup);
-               else if (xs->tx)
+               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
                         /* Poll needs to drive Tx also in copy mode */
                         xsk_generic_xmit(sk);
         }
Magnus Karlsson April 1, 2025, 6:57 a.m. UTC | #4
On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@huawei.com> wrote:
>
>
> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
> > On 03/31, Stanislav Fomichev wrote:
> >> On 03/29, Wang Liang wrote:
> >>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> >>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> >>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> >>> tx_ring_empty_descs count increases once the xsk_poll()is called:
> >>>
> >>>    xsk_poll
> >>>      xsk_generic_xmit
> >>>        __xsk_generic_xmit
> >>>          xskq_cons_peek_desc
> >>>            xskq_cons_read_desc
> >>>              q->queue_empty_descs++;

Sorry, but I do not understand how to reproduce this error. So you
first issue a setsockopt with the XDP_TX_RING option and then you do
not "reserve tx ring". What does that last "not reserve tx ring" mean?
No mmap() of that ring, or something else? I guess you have bound the
socket with a bind()? Some pseudo code on how to reproduce this would
be helpful. Just want to understand so I can help. Thank you.

> >>>
> >>> To avoid this count error, add check for tx descs before send msg in poll.
> >>>
> >>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> >>> Signed-off-by: Wang Liang <wangliang74@huawei.com>
> >> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> > Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
> > cached prod/cons. How is it supposed to work when the actual tx
> > descriptor is posted? Is there anything besides xskq_cons_peek_desc from
> > __xsk_generic_xmit that refreshes cached_prod?
>
>
> Yes, you are right!
>
> How about using xskq_cons_nb_entries() to check free descriptors?
>
> Like this:
>
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index e5d104ce7b82..babb7928d335 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
> socket *sock,
>          if (pool->cached_need_wakeup) {
>                  if (xs->zc)
>                          xsk_wakeup(xs, pool->cached_need_wakeup);
> -               else if (xs->tx)
> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
>                          /* Poll needs to drive Tx also in copy mode */
>                          xsk_generic_xmit(sk);
>          }
>
>
Wang Liang April 1, 2025, 7:39 a.m. UTC | #5
在 2025/4/1 14:57, Magnus Karlsson 写道:
> On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@huawei.com> wrote:
>>
>> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
>>> On 03/31, Stanislav Fomichev wrote:
>>>> On 03/29, Wang Liang wrote:
>>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
>>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
>>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
>>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
>>>>>
>>>>>     xsk_poll
>>>>>       xsk_generic_xmit
>>>>>         __xsk_generic_xmit
>>>>>           xskq_cons_peek_desc
>>>>>             xskq_cons_read_desc
>>>>>               q->queue_empty_descs++;
> Sorry, but I do not understand how to reproduce this error. So you
> first issue a setsockopt with the XDP_TX_RING option and then you do
> not "reserve tx ring". What does that last "not reserve tx ring" mean?
> No mmap() of that ring, or something else? I guess you have bound the
> socket with a bind()? Some pseudo code on how to reproduce this would
> be helpful. Just want to understand so I can help. Thank you.
>
Ok. Some pseudo code like below: fd = socket(AF_XDP, SOCK_RAW, 0); 
setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)); setsockopt(fd, 
SOL_XDP, XDP_UMEM_FILL_RING, &fill_size, sizeof(fill_size)); 
setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size, 
sizeof(comp_size)); mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), 
..., XDP_UMEM_PGOFF_FILL_RING); mmap(NULL, off.cr.desc + comp_size * 
sizeof(__u64), ..., XDP_UMEM_PGOFF_COMPLETION_RING); setsockopt(fd, 
SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size)); setsockopt(fd, 
SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size)); mmap(NULL, off.rx.desc 
+ rx_size * sizeof(struct xdp_desc), ..., XDP_PGOFF_RX_RING); mmap(NULL, 
off.tx.desc + tx_size * sizeof(struct xdp_desc), ..., 
XDP_PGOFF_TX_RING); bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp)); 
bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0); while(!global_exit) 
{ poll(fds, 1, -1); handle_receive_packets(...); } The xsk is created 
success, and xs->tx is initialized. The "not reserve tx ring" means user 
app do not update tx ring producer. Like: xsk_ring_prod__reserve(tx, 1, 
&tx_idx); xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame; 
xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length; 
xsk_ring_prod__submit(tx, 1); These functions (xsk_ring_prod__reserve, 
etc.) is provided by libxdp. The tx->producer is not updated, so the 
xs->tx->cached_cons and xs->tx->cached_prod are always zero. When 
receive packets and user app call poll(), xsk_generic_xmit() will be 
triggered by xsk_poll(), leading to this issue.
Wang Liang April 1, 2025, 7:43 a.m. UTC | #6
在 2025/4/1 14:57, Magnus Karlsson 写道:
> On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@huawei.com> wrote:
>>
>> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
>>> On 03/31, Stanislav Fomichev wrote:
>>>> On 03/29, Wang Liang wrote:
>>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
>>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
>>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
>>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
>>>>>
>>>>>     xsk_poll
>>>>>       xsk_generic_xmit
>>>>>         __xsk_generic_xmit
>>>>>           xskq_cons_peek_desc
>>>>>             xskq_cons_read_desc
>>>>>               q->queue_empty_descs++;
> Sorry, but I do not understand how to reproduce this error. So you
> first issue a setsockopt with the XDP_TX_RING option and then you do
> not "reserve tx ring". What does that last "not reserve tx ring" mean?
> No mmap() of that ring, or something else? I guess you have bound the
> socket with a bind()? Some pseudo code on how to reproduce this would
> be helpful. Just want to understand so I can help. Thank you.
Sorry, the last email is garbled, and send again.

Ok. Some pseudo code like below:

     fd = socket(AF_XDP, SOCK_RAW, 0);
     setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));

     setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size, 
sizeof(fill_size));
     setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size, 
sizeof(comp_size));
     mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ..., 
XDP_UMEM_PGOFF_FILL_RING);
     mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ..., 
XDP_UMEM_PGOFF_COMPLETION_RING);

     setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
     setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
     mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ..., 
XDP_PGOFF_RX_RING);
     mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ..., 
XDP_PGOFF_TX_RING);

     bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
     bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);

     while(!global_exit) {
         poll(fds, 1, -1);
         handle_receive_packets(...);
     }

The xsk is created success, and xs->tx is initialized.

The "not reserve tx ring" means user app do not update tx ring producer.
Like:

     xsk_ring_prod__reserve(tx, 1, &tx_idx);
     xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
     xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
     xsk_ring_prod__submit(tx, 1);

These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.

The tx->producer is not updated, so the xs->tx->cached_cons and
xs->tx->cached_prod are always zero.

When receive packets and user app call poll(), xsk_generic_xmit() will be
triggered by xsk_poll(), leading to this issue.
>>>>> To avoid this count error, add check for tx descs before send msg in poll.
>>>>>
>>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
>>>>> Signed-off-by: Wang Liang <wangliang74@huawei.com>
>>>> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
>>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
>>> cached prod/cons. How is it supposed to work when the actual tx
>>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
>>> __xsk_generic_xmit that refreshes cached_prod?
>>
>> Yes, you are right!
>>
>> How about using xskq_cons_nb_entries() to check free descriptors?
>>
>> Like this:
>>
>>
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index e5d104ce7b82..babb7928d335 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
>> socket *sock,
>>           if (pool->cached_need_wakeup) {
>>                   if (xs->zc)
>>                           xsk_wakeup(xs, pool->cached_need_wakeup);
>> -               else if (xs->tx)
>> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
>>                           /* Poll needs to drive Tx also in copy mode */
>>                           xsk_generic_xmit(sk);
>>           }
>>
>>
Magnus Karlsson April 1, 2025, 8:12 a.m. UTC | #7
On Tue, 1 Apr 2025 at 09:44, Wang Liang <wangliang74@huawei.com> wrote:
>
>
> 在 2025/4/1 14:57, Magnus Karlsson 写道:
> > On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@huawei.com> wrote:
> >>
> >> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
> >>> On 03/31, Stanislav Fomichev wrote:
> >>>> On 03/29, Wang Liang wrote:
> >>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> >>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> >>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> >>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
> >>>>>
> >>>>>     xsk_poll
> >>>>>       xsk_generic_xmit
> >>>>>         __xsk_generic_xmit
> >>>>>           xskq_cons_peek_desc
> >>>>>             xskq_cons_read_desc
> >>>>>               q->queue_empty_descs++;
> > Sorry, but I do not understand how to reproduce this error. So you
> > first issue a setsockopt with the XDP_TX_RING option and then you do
> > not "reserve tx ring". What does that last "not reserve tx ring" mean?
> > No mmap() of that ring, or something else? I guess you have bound the
> > socket with a bind()? Some pseudo code on how to reproduce this would
> > be helpful. Just want to understand so I can help. Thank you.
> Sorry, the last email is garbled, and send again.
>
> Ok. Some pseudo code like below:
>
>      fd = socket(AF_XDP, SOCK_RAW, 0);
>      setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
>
>      setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size,
> sizeof(fill_size));
>      setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size,
> sizeof(comp_size));
>      mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ...,
> XDP_UMEM_PGOFF_FILL_RING);
>      mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ...,
> XDP_UMEM_PGOFF_COMPLETION_RING);
>
>      setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
>      setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
>      mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ...,
> XDP_PGOFF_RX_RING);
>      mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ...,
> XDP_PGOFF_TX_RING);
>
>      bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
>      bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);
>
>      while(!global_exit) {
>          poll(fds, 1, -1);
>          handle_receive_packets(...);
>      }
>
> The xsk is created success, and xs->tx is initialized.
>
> The "not reserve tx ring" means user app do not update tx ring producer.
> Like:
>
>      xsk_ring_prod__reserve(tx, 1, &tx_idx);
>      xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
>      xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
>      xsk_ring_prod__submit(tx, 1);
>
> These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.
>
> The tx->producer is not updated, so the xs->tx->cached_cons and
> xs->tx->cached_prod are always zero.
>
> When receive packets and user app call poll(), xsk_generic_xmit() will be
> triggered by xsk_poll(), leading to this issue.

Thanks, that really helped. The problem here is that the kernel cannot
guess your intent. Since you created a socket with both Rx and Tx, it
thinks you will use it for both, so it should increase
queue_empty_descs in this case as you did not provide any Tx descs.
Your proposed patch will break this. Consider this Tx case with the
exact same init code as you have above but with this send loop:

while(!global_exit) {
       maybe_send_packets(...);
       poll(fds, 1, -1);
}

With your patch, the queue_empty_descs will never be increased in the
case when I do not submit any Tx descs, even though we would like it
to be so.

So in my mind, you have a couple of options:

* Create two sockets, one rx only and one tx only and use the
SHARED_UMEM mode to bind them to the same netdev and queue id. In your
loop above, you would use the Rx socket. This might have the drawback
that you need to call poll() twice if you are both sending and
receiving in the same loop. But the stats will be the way you want
them to be.

* Introduce a new variable in user space that you increase every time
you do poll() in your loop above. When displaying the statistics, just
deduct this variable from the queue_empty_descs that the kernel
reports using the XDP_STATISTICS getsockopt().

Hope this helps.

> >>>>> To avoid this count error, add check for tx descs before send msg in poll.
> >>>>>
> >>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> >>>>> Signed-off-by: Wang Liang <wangliang74@huawei.com>
> >>>> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> >>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
> >>> cached prod/cons. How is it supposed to work when the actual tx
> >>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
> >>> __xsk_generic_xmit that refreshes cached_prod?
> >>
> >> Yes, you are right!
> >>
> >> How about using xskq_cons_nb_entries() to check free descriptors?
> >>
> >> Like this:
> >>
> >>
> >> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >> index e5d104ce7b82..babb7928d335 100644
> >> --- a/net/xdp/xsk.c
> >> +++ b/net/xdp/xsk.c
> >> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
> >> socket *sock,
> >>           if (pool->cached_need_wakeup) {
> >>                   if (xs->zc)
> >>                           xsk_wakeup(xs, pool->cached_need_wakeup);
> >> -               else if (xs->tx)
> >> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
> >>                           /* Poll needs to drive Tx also in copy mode */
> >>                           xsk_generic_xmit(sk);
> >>           }
> >>
> >>
Wang Liang April 1, 2025, 9:33 a.m. UTC | #8
在 2025/4/1 16:12, Magnus Karlsson 写道:
> On Tue, 1 Apr 2025 at 09:44, Wang Liang <wangliang74@huawei.com> wrote:
>>
>> 在 2025/4/1 14:57, Magnus Karlsson 写道:
>>> On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@huawei.com> wrote:
>>>> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
>>>>> On 03/31, Stanislav Fomichev wrote:
>>>>>> On 03/29, Wang Liang wrote:
>>>>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
>>>>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
>>>>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
>>>>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
>>>>>>>
>>>>>>>      xsk_poll
>>>>>>>        xsk_generic_xmit
>>>>>>>          __xsk_generic_xmit
>>>>>>>            xskq_cons_peek_desc
>>>>>>>              xskq_cons_read_desc
>>>>>>>                q->queue_empty_descs++;
>>> Sorry, but I do not understand how to reproduce this error. So you
>>> first issue a setsockopt with the XDP_TX_RING option and then you do
>>> not "reserve tx ring". What does that last "not reserve tx ring" mean?
>>> No mmap() of that ring, or something else? I guess you have bound the
>>> socket with a bind()? Some pseudo code on how to reproduce this would
>>> be helpful. Just want to understand so I can help. Thank you.
>> Sorry, the last email is garbled, and send again.
>>
>> Ok. Some pseudo code like below:
>>
>>       fd = socket(AF_XDP, SOCK_RAW, 0);
>>       setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
>>
>>       setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size,
>> sizeof(fill_size));
>>       setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size,
>> sizeof(comp_size));
>>       mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ...,
>> XDP_UMEM_PGOFF_FILL_RING);
>>       mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ...,
>> XDP_UMEM_PGOFF_COMPLETION_RING);
>>
>>       setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
>>       setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
>>       mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ...,
>> XDP_PGOFF_RX_RING);
>>       mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ...,
>> XDP_PGOFF_TX_RING);
>>
>>       bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
>>       bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);
>>
>>       while(!global_exit) {
>>           poll(fds, 1, -1);
>>           handle_receive_packets(...);
>>       }
>>
>> The xsk is created success, and xs->tx is initialized.
>>
>> The "not reserve tx ring" means user app do not update tx ring producer.
>> Like:
>>
>>       xsk_ring_prod__reserve(tx, 1, &tx_idx);
>>       xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
>>       xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
>>       xsk_ring_prod__submit(tx, 1);
>>
>> These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.
>>
>> The tx->producer is not updated, so the xs->tx->cached_cons and
>> xs->tx->cached_prod are always zero.
>>
>> When receive packets and user app call poll(), xsk_generic_xmit() will be
>> triggered by xsk_poll(), leading to this issue.
> Thanks, that really helped. The problem here is that the kernel cannot
> guess your intent. Since you created a socket with both Rx and Tx, it
> thinks you will use it for both, so it should increase
> queue_empty_descs in this case as you did not provide any Tx descs.
> Your proposed patch will break this. Consider this Tx case with the
> exact same init code as you have above but with this send loop:
>
> while(!global_exit) {
>         maybe_send_packets(...);
>         poll(fds, 1, -1);
> }
>
> With your patch, the queue_empty_descs will never be increased in the
> case when I do not submit any Tx descs, even though we would like it
> to be so.
>
> So in my mind, you have a couple of options:
>
> * Create two sockets, one rx only and one tx only and use the
> SHARED_UMEM mode to bind them to the same netdev and queue id. In your
> loop above, you would use the Rx socket. This might have the drawback
> that you need to call poll() twice if you are both sending and
> receiving in the same loop. But the stats will be the way you want
> them to be.
>
> * Introduce a new variable in user space that you increase every time
> you do poll() in your loop above. When displaying the statistics, just
> deduct this variable from the queue_empty_descs that the kernel
> reports using the XDP_STATISTICS getsockopt().
>
> Hope this helps.


Thank you for the advices.

 From user view, queue_empty_descs increases when the app only receive 
packets and call poll(), it is some confusing.

In your Tx case, if user app use sendto() to send packets, the 
queue_empty_descs will increase. This is reasonably.

In linux manual, poll() waits for some event on a file descriptor. But 
in af_xdp, poll() has a side effect of send msg.

Previous commit 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in 
AF_XDP rings") add need_wakeup flag. It mainly work in rx process. When 
the application and driver run on the same core, if the fill ring is 
empty, the driver can set the need_wakeup flag and return, so the 
application could be scheduled to produce entries of the fill ring.

The commit df551058f7a3 ("xsk: Fix crash in poll when device does not 
support ndo_xsk_wakeup") add sendmsg function in xsk_poll(), considering 
some devices donot define ndo_xsk_wakeup.

At present the value of need_wakeup & XDP_WAKEUP_TX is always true, 
except the mlx5 driver. If the mlx5 driver queued some packets for tx, 
it may clear XDP_WAKEUP_TX by xsk_clear_tx_need_wakeup(). So when user 
app use sendto() to send packets, __xsk_sendmsg() will return without 
calling xsk_generic_xmit().

So I have a bold suggestion, how about removing xsk_generic_xmit() in 
xsk_poll()?

>>>>>>> To avoid this count error, add check for tx descs before send msg in poll.
>>>>>>>
>>>>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
>>>>>>> Signed-off-by: Wang Liang <wangliang74@huawei.com>
>>>>>> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
>>>>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
>>>>> cached prod/cons. How is it supposed to work when the actual tx
>>>>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
>>>>> __xsk_generic_xmit that refreshes cached_prod?
>>>> Yes, you are right!
>>>>
>>>> How about using xskq_cons_nb_entries() to check free descriptors?
>>>>
>>>> Like this:
>>>>
>>>>
>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>> index e5d104ce7b82..babb7928d335 100644
>>>> --- a/net/xdp/xsk.c
>>>> +++ b/net/xdp/xsk.c
>>>> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
>>>> socket *sock,
>>>>            if (pool->cached_need_wakeup) {
>>>>                    if (xs->zc)
>>>>                            xsk_wakeup(xs, pool->cached_need_wakeup);
>>>> -               else if (xs->tx)
>>>> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
>>>>                            /* Poll needs to drive Tx also in copy mode */
>>>>                            xsk_generic_xmit(sk);
>>>>            }
>>>>
>>>>
Magnus Karlsson April 1, 2025, 11 a.m. UTC | #9
On Tue, 1 Apr 2025 at 11:33, Wang Liang <wangliang74@huawei.com> wrote:
>
>
> 在 2025/4/1 16:12, Magnus Karlsson 写道:
> > On Tue, 1 Apr 2025 at 09:44, Wang Liang <wangliang74@huawei.com> wrote:
> >>
> >> 在 2025/4/1 14:57, Magnus Karlsson 写道:
> >>> On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@huawei.com> wrote:
> >>>> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
> >>>>> On 03/31, Stanislav Fomichev wrote:
> >>>>>> On 03/29, Wang Liang wrote:
> >>>>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> >>>>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> >>>>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> >>>>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
> >>>>>>>
> >>>>>>>      xsk_poll
> >>>>>>>        xsk_generic_xmit
> >>>>>>>          __xsk_generic_xmit
> >>>>>>>            xskq_cons_peek_desc
> >>>>>>>              xskq_cons_read_desc
> >>>>>>>                q->queue_empty_descs++;
> >>> Sorry, but I do not understand how to reproduce this error. So you
> >>> first issue a setsockopt with the XDP_TX_RING option and then you do
> >>> not "reserve tx ring". What does that last "not reserve tx ring" mean?
> >>> No mmap() of that ring, or something else? I guess you have bound the
> >>> socket with a bind()? Some pseudo code on how to reproduce this would
> >>> be helpful. Just want to understand so I can help. Thank you.
> >> Sorry, the last email is garbled, and send again.
> >>
> >> Ok. Some pseudo code like below:
> >>
> >>       fd = socket(AF_XDP, SOCK_RAW, 0);
> >>       setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
> >>
> >>       setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size,
> >> sizeof(fill_size));
> >>       setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size,
> >> sizeof(comp_size));
> >>       mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ...,
> >> XDP_UMEM_PGOFF_FILL_RING);
> >>       mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ...,
> >> XDP_UMEM_PGOFF_COMPLETION_RING);
> >>
> >>       setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
> >>       setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
> >>       mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ...,
> >> XDP_PGOFF_RX_RING);
> >>       mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ...,
> >> XDP_PGOFF_TX_RING);
> >>
> >>       bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
> >>       bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);
> >>
> >>       while(!global_exit) {
> >>           poll(fds, 1, -1);
> >>           handle_receive_packets(...);
> >>       }
> >>
> >> The xsk is created success, and xs->tx is initialized.
> >>
> >> The "not reserve tx ring" means user app do not update tx ring producer.
> >> Like:
> >>
> >>       xsk_ring_prod__reserve(tx, 1, &tx_idx);
> >>       xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
> >>       xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
> >>       xsk_ring_prod__submit(tx, 1);
> >>
> >> These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.
> >>
> >> The tx->producer is not updated, so the xs->tx->cached_cons and
> >> xs->tx->cached_prod are always zero.
> >>
> >> When receive packets and user app call poll(), xsk_generic_xmit() will be
> >> triggered by xsk_poll(), leading to this issue.
> > Thanks, that really helped. The problem here is that the kernel cannot
> > guess your intent. Since you created a socket with both Rx and Tx, it
> > thinks you will use it for both, so it should increase
> > queue_empty_descs in this case as you did not provide any Tx descs.
> > Your proposed patch will break this. Consider this Tx case with the
> > exact same init code as you have above but with this send loop:
> >
> > while(!global_exit) {
> >         maybe_send_packets(...);
> >         poll(fds, 1, -1);
> > }
> >
> > With your patch, the queue_empty_descs will never be increased in the
> > case when I do not submit any Tx descs, even though we would like it
> > to be so.
> >
> > So in my mind, you have a couple of options:
> >
> > * Create two sockets, one rx only and one tx only and use the
> > SHARED_UMEM mode to bind them to the same netdev and queue id. In your
> > loop above, you would use the Rx socket. This might have the drawback
> > that you need to call poll() twice if you are both sending and
> > receiving in the same loop. But the stats will be the way you want
> > them to be.
> >
> > * Introduce a new variable in user space that you increase every time
> > you do poll() in your loop above. When displaying the statistics, just
> > deduct this variable from the queue_empty_descs that the kernel
> > reports using the XDP_STATISTICS getsockopt().
> >
> > Hope this helps.
>
>
> Thank you for the advices.
>
>  From user view, queue_empty_descs increases when the app only receive
> packets and call poll(), it is some confusing.

But if the only thing you are doing in the app is receive packets, you
should create an Rx only socket.

> In your Tx case, if user app use sendto() to send packets, the
> queue_empty_descs will increase. This is reasonably.
>
> In linux manual, poll() waits for some event on a file descriptor. But
> in af_xdp, poll() has a side effect of send msg.

If I remember correctly, we did this so that zero-copy mode and copy
mode should have the same behavior from an app point of view. But I do
not remember why we implemented this behavior in zero-copy in the
first place. Maybe out of necessity since zero-copy Tx is driven by
the driver.

> Previous commit 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in
> AF_XDP rings") add need_wakeup flag. It mainly work in rx process. When
> the application and driver run on the same core, if the fill ring is
> empty, the driver can set the need_wakeup flag and return, so the
> application could be scheduled to produce entries of the fill ring.
>
> The commit df551058f7a3 ("xsk: Fix crash in poll when device does not
> support ndo_xsk_wakeup") add sendmsg function in xsk_poll(), considering
> some devices donot define ndo_xsk_wakeup.
>
> At present the value of need_wakeup & XDP_WAKEUP_TX is always true,
> except the mlx5 driver. If the mlx5 driver queued some packets for tx,
> it may clear XDP_WAKEUP_TX by xsk_clear_tx_need_wakeup(). So when user
> app use sendto() to send packets, __xsk_sendmsg() will return without
> calling xsk_generic_xmit().

Mellanox is doing it in the correct way. The Intel drivers had it like
this too in the beginning, until we discovered a race condition in the
HW in which the interrupt was not armed at the same time as the
need_wakeup flag was cleared. This would then lead to packets not
being sent and user-space not knowing about it. As to why all other
drivers copied this unfortunate fall-back, I do not know.

> So I have a bold suggestion, how about removing xsk_generic_xmit() in
> xsk_poll()?

That would indeed be bold :-)! But we cannot do this since it would
break existing applications that rely on this. I think you have to use
one of the workarounds I pitched in the previous mail. If you are
really just receiving, you should create an Rx only socket.

> >>>>>>> To avoid this count error, add check for tx descs before send msg in poll.
> >>>>>>>
> >>>>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> >>>>>>> Signed-off-by: Wang Liang <wangliang74@huawei.com>
> >>>>>> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
> >>>>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
> >>>>> cached prod/cons. How is it supposed to work when the actual tx
> >>>>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
> >>>>> __xsk_generic_xmit that refreshes cached_prod?
> >>>> Yes, you are right!
> >>>>
> >>>> How about using xskq_cons_nb_entries() to check free descriptors?
> >>>>
> >>>> Like this:
> >>>>
> >>>>
> >>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>>> index e5d104ce7b82..babb7928d335 100644
> >>>> --- a/net/xdp/xsk.c
> >>>> +++ b/net/xdp/xsk.c
> >>>> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
> >>>> socket *sock,
> >>>>            if (pool->cached_need_wakeup) {
> >>>>                    if (xs->zc)
> >>>>                            xsk_wakeup(xs, pool->cached_need_wakeup);
> >>>> -               else if (xs->tx)
> >>>> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
> >>>>                            /* Poll needs to drive Tx also in copy mode */
> >>>>                            xsk_generic_xmit(sk);
> >>>>            }
> >>>>
> >>>>
diff mbox series

Patch

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 89d2bef96469..fb01e6736677 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -989,7 +989,7 @@  static __poll_t xsk_poll(struct file *file, struct socket *sock,
 	if (pool->cached_need_wakeup) {
 		if (xs->zc)
 			xsk_wakeup(xs, pool->cached_need_wakeup);
-		else if (xs->tx)
+		else if (xs->tx && xskq_has_descs(xs->tx))
 			/* Poll needs to drive Tx also in copy mode */
 			xsk_generic_xmit(sk);
 	}