diff mbox series

[net,v4,1/2] net: sched: fix packet stuck problem for lockless qdisc

Message ID 1618535809-11952-2-git-send-email-linyunsheng@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series fix packet stuck problem for lockless qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 15 of 15 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 4504 this patch: 4504
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 84 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 4744 this patch: 4744
netdev/header_inline success Link

Commit Message

Yunsheng Lin April 16, 2021, 1:16 a.m. UTC
Lockless qdisc has below concurrent problem:
    cpu0                 cpu1
     .                     .
q->enqueue                 .
     .                     .
qdisc_run_begin()          .
     .                     .
dequeue_skb()              .
     .                     .
sch_direct_xmit()          .
     .                     .
     .                q->enqueue
     .             qdisc_run_begin()
     .            return and do nothing
     .                     .
qdisc_run_end()            .

cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().

Lockless qdisc has below another concurrent problem when
tx_action is involved:

cpu0(serving tx_action)     cpu1             cpu2
          .                   .                .
          .              q->enqueue            .
          .            qdisc_run_begin()       .
          .              dequeue_skb()         .
          .                   .            q->enqueue
          .                   .                .
          .             sch_direct_xmit()      .
          .                   .         qdisc_run_begin()
          .                   .       return and do nothing
          .                   .                .
 clear __QDISC_STATE_SCHED    .                .
 qdisc_run_begin()            .                .
 return and do nothing        .                .
          .                   .                .
          .            qdisc_run_end()         .

This patch fixes the above data race by:
1. Test STATE_MISSED before doing spin_trylock().
2. If the first spin_trylock() return false and STATE_MISSED is
   not set before the first spin_trylock(), Set STATE_MISSED and
   retry another spin_trylock() in case other CPU may not see
   STATE_MISSED after it releases the lock.
3. reschedule if STATE_MISSED is set after the lock is released
   at the end of qdisc_run_end().

For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.

Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the above double
spin_trylock() and __netif_schedule() calling.

The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:

 threads  without+this_patch   with+this_patch      delta
    1        2.61Mpps            2.60Mpps           -0.3%
    2        3.97Mpps            3.82Mpps           -3.7%
    4        5.62Mpps            5.59Mpps           -0.5%
    8        2.78Mpps            2.77Mpps           -0.3%
   16        2.22Mpps            2.22Mpps           -0.0%

Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring
    NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by"
    tag for there is only renaming and typo fixing between
    V4 and V3.
V3: Fix a compile error and a few comment typo, remove the
    __QDISC_STATE_DEACTIVATED checking, and update the
    performance data.
V2: Avoid the overhead of fixing the data race as much as
    possible.
---
 include/net/sch_generic.h | 37 ++++++++++++++++++++++++++++++++++++-
 net/sched/sch_generic.c   | 12 ++++++++++++
 2 files changed, 48 insertions(+), 1 deletion(-)

Comments

Michal Kubecek April 19, 2021, 3:29 p.m. UTC | #1
On Fri, Apr 16, 2021 at 09:16:48AM +0800, Yunsheng Lin wrote:
> Lockless qdisc has below concurrent problem:
>     cpu0                 cpu1
>      .                     .
> q->enqueue                 .
>      .                     .
> qdisc_run_begin()          .
>      .                     .
> dequeue_skb()              .
>      .                     .
> sch_direct_xmit()          .
>      .                     .
>      .                q->enqueue
>      .             qdisc_run_begin()
>      .            return and do nothing
>      .                     .
> qdisc_run_end()            .
> 
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
> 
> Lockless qdisc has below another concurrent problem when
> tx_action is involved:
> 
> cpu0(serving tx_action)     cpu1             cpu2
>           .                   .                .
>           .              q->enqueue            .
>           .            qdisc_run_begin()       .
>           .              dequeue_skb()         .
>           .                   .            q->enqueue
>           .                   .                .
>           .             sch_direct_xmit()      .
>           .                   .         qdisc_run_begin()
>           .                   .       return and do nothing
>           .                   .                .
>  clear __QDISC_STATE_SCHED    .                .
>  qdisc_run_begin()            .                .
>  return and do nothing        .                .
>           .                   .                .
>           .            qdisc_run_end()         .
> 
> This patch fixes the above data race by:
> 1. Test STATE_MISSED before doing spin_trylock().
> 2. If the first spin_trylock() return false and STATE_MISSED is
>    not set before the first spin_trylock(), Set STATE_MISSED and
>    retry another spin_trylock() in case other CPU may not see
>    STATE_MISSED after it releases the lock.
> 3. reschedule if STATE_MISSED is set after the lock is released
>    at the end of qdisc_run_end().
> 
> For tx_action case, STATE_MISSED is also set when cpu1 is at the
> end if qdisc_run_end(), so tx_action will be rescheduled again
> to dequeue the skb enqueued by cpu2.
> 
> Clear STATE_MISSED before retrying a dequeuing when dequeuing
> returns NULL in order to reduce the overhead of the above double
> spin_trylock() and __netif_schedule() calling.
> 
> The performance impact of this patch, tested using pktgen and
> dummy netdev with pfifo_fast qdisc attached:
> 
>  threads  without+this_patch   with+this_patch      delta
>     1        2.61Mpps            2.60Mpps           -0.3%
>     2        3.97Mpps            3.82Mpps           -3.7%
>     4        5.62Mpps            5.59Mpps           -0.5%
>     8        2.78Mpps            2.77Mpps           -0.3%
>    16        2.22Mpps            2.22Mpps           -0.0%
> 
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Tested-by: Juergen Gross <jgross@suse.com>
> ---
> V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring
>     NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by"
>     tag for there is only renaming and typo fixing between
>     V4 and V3.
> V3: Fix a compile error and a few comment typo, remove the
>     __QDISC_STATE_DEACTIVATED checking, and update the
>     performance data.
> V2: Avoid the overhead of fixing the data race as much as
>     possible.

As pointed out in the discussion on v3, this patch may result in
significantly higher CPU consumption with multiple threads competing on
a saturated outgoing device. I missed this submission so that I haven't
checked it yet but given the description of v3->v4 changes above, it's
quite likely that it suffers from the same problem.

Michal
Michal Kubecek April 19, 2021, 11:55 p.m. UTC | #2
On Mon, Apr 19, 2021 at 05:29:46PM +0200, Michal Kubecek wrote:
> 
> As pointed out in the discussion on v3, this patch may result in
> significantly higher CPU consumption with multiple threads competing on
> a saturated outgoing device. I missed this submission so that I haven't
> checked it yet but given the description of v3->v4 changes above, it's
> quite likely that it suffers from the same problem.

And it indeed does. However, with the additional patch from the v3
discussion, the numbers are approximately the same as with an unpatched
mainline kernel.

As with v4, I tried this patch on top of 5.12-rc7 with real devices.
I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs
(2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has
48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues.

  threads    5.12-rc7    5.12-rc7 + v4    5.12-rc7 + v4 + stop
     1        25.1%          38.1%            22.9%
     8        66.2%         277.0%            74.1%
    16        90.1%         150.7%            91.0%
    32       107.2%         272.6%           108.3%
    64       116.3%         487.5%           118.1%
   128       126.1%         946.7%           126.9%

(The values are normalized to one core, i.e. 100% corresponds to one
fully used logical CPU.)

So it seems that repeated scheduling while the queue was stopped is
indeed the main performance issue and that other cases of the logic
being too pessimistic do not play significant role. There is an
exception with 8 connections/threads and the result with just this
series also looks abnormally high (e.g. much higher than with
16 threads). It might be worth investigating what happens there and
what do the results with other thread counts around 8 look like.

I'll run some more tests with other traffic patterns tomorrow and
I'm also going to take a closer look at the additional patch.

Michal
Yunsheng Lin April 20, 2021, 2:23 a.m. UTC | #3
On 2021/4/20 7:55, Michal Kubecek wrote:
> On Mon, Apr 19, 2021 at 05:29:46PM +0200, Michal Kubecek wrote:
>>
>> As pointed out in the discussion on v3, this patch may result in
>> significantly higher CPU consumption with multiple threads competing on
>> a saturated outgoing device. I missed this submission so that I haven't
>> checked it yet but given the description of v3->v4 changes above, it's
>> quite likely that it suffers from the same problem.
> 
> And it indeed does. However, with the additional patch from the v3
> discussion, the numbers are approximately the same as with an unpatched
> mainline kernel.
> 
> As with v4, I tried this patch on top of 5.12-rc7 with real devices.
> I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs
> (2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has
> 48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues.
> 
>   threads    5.12-rc7    5.12-rc7 + v4    5.12-rc7 + v4 + stop
>      1        25.1%          38.1%            22.9%
>      8        66.2%         277.0%            74.1%
>     16        90.1%         150.7%            91.0%
>     32       107.2%         272.6%           108.3%
>     64       116.3%         487.5%           118.1%
>    128       126.1%         946.7%           126.9%
> 
> (The values are normalized to one core, i.e. 100% corresponds to one
> fully used logical CPU.)
> 
> So it seems that repeated scheduling while the queue was stopped is
> indeed the main performance issue and that other cases of the logic
> being too pessimistic do not play significant role. There is an
> exception with 8 connections/threads and the result with just this
> series also looks abnormally high (e.g. much higher than with
> 16 threads). It might be worth investigating what happens there and
> what do the results with other thread counts around 8 look like.

Will try to investigate the 8 connections/threads case.

> 
> I'll run some more tests with other traffic patterns tomorrow and
> I'm also going to take a closer look at the additional patch.

Thanks for taking the detail testing and looking.

> 
> Michal
> 
> .
>
Michal Kubecek April 20, 2021, 8:34 p.m. UTC | #4
On Tue, Apr 20, 2021 at 01:55:03AM +0200, Michal Kubecek wrote:
> On Mon, Apr 19, 2021 at 05:29:46PM +0200, Michal Kubecek wrote:
> > 
> > As pointed out in the discussion on v3, this patch may result in
> > significantly higher CPU consumption with multiple threads competing on
> > a saturated outgoing device. I missed this submission so that I haven't
> > checked it yet but given the description of v3->v4 changes above, it's
> > quite likely that it suffers from the same problem.
> 
> And it indeed does. However, with the additional patch from the v3
> discussion, the numbers are approximately the same as with an unpatched
> mainline kernel.
> 
> As with v4, I tried this patch on top of 5.12-rc7 with real devices.
> I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs
> (2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has
> 48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues.
> 
>   threads    5.12-rc7    5.12-rc7 + v4    5.12-rc7 + v4 + stop
>      1        25.1%          38.1%            22.9%
>      8        66.2%         277.0%            74.1%
>     16        90.1%         150.7%            91.0%
>     32       107.2%         272.6%           108.3%
>     64       116.3%         487.5%           118.1%
>    128       126.1%         946.7%           126.9%
> 
> (The values are normalized to one core, i.e. 100% corresponds to one
> fully used logical CPU.)

I repeated the tests few more times and with more iterations and it
seems the problem rather was that the CPU utilization numbers are not
very stable, in particular with number of connections/threads close to
the number of CPUs and Tx queues. Refined results (and also other tests)
show that full 3-patch series performs similar to unpatched 5.12-rc7
(within the margin of statistical error).

However, I noticed something disturbing in the results of a simple
1-thread TCP_STREAM test (client sends data through a TCP connection to
server using long writes, we measure the amount of data received by the
server):

  server: 172.17.1.1, port 12543
  iterations: 20, threads: 1, test length: 30
  test: TCP_STREAM, message size: 1048576
  
  1     927403548.4 B/s,  avg   927403548.4 B/s, mdev           0.0 B/s (  0.0%)
  2    1176317172.1 B/s,  avg  1051860360.2 B/s, mdev   124456811.8 B/s ( 11.8%), confid. +/-  1581348251.3 B/s (150.3%)
  3     927335837.8 B/s,  avg  1010352186.1 B/s, mdev   117354970.3 B/s ( 11.6%), confid. +/-   357073677.2 B/s ( 35.3%)
  4    1176728045.1 B/s,  avg  1051946150.8 B/s, mdev   124576544.7 B/s ( 11.8%), confid. +/-   228863127.8 B/s ( 21.8%)
  5    1176788216.3 B/s,  avg  1076914563.9 B/s, mdev   122102985.3 B/s ( 11.3%), confid. +/-   169478943.5 B/s ( 15.7%)
  6    1158167055.1 B/s,  avg  1090456645.8 B/s, mdev   115504209.5 B/s ( 10.6%), confid. +/-   132805140.8 B/s ( 12.2%)
  7    1176243474.4 B/s,  avg  1102711907.0 B/s, mdev   111069717.1 B/s ( 10.1%), confid. +/-   110956822.2 B/s ( 10.1%)
  8    1176771142.8 B/s,  avg  1111969311.5 B/s, mdev   106744173.5 B/s (  9.6%), confid. +/-    95417120.0 B/s (  8.6%)
  9    1176206364.6 B/s,  avg  1119106761.8 B/s, mdev   102644185.2 B/s (  9.2%), confid. +/-    83685200.5 B/s (  7.5%)
  10   1175888409.4 B/s,  avg  1124784926.6 B/s, mdev    98855550.5 B/s (  8.8%), confid. +/-    74537085.1 B/s (  6.6%)
  11   1176541407.6 B/s,  avg  1129490061.2 B/s, mdev    95422224.8 B/s (  8.4%), confid. +/-    67230249.7 B/s (  6.0%)
  12    934185352.8 B/s,  avg  1113214668.9 B/s, mdev   106114984.5 B/s (  9.5%), confid. +/-    70420712.5 B/s (  6.3%)
  13   1176550558.1 B/s,  avg  1118086660.3 B/s, mdev   103339448.9 B/s (  9.2%), confid. +/-    65002902.4 B/s (  5.8%)
  14   1176521808.8 B/s,  avg  1122260599.5 B/s, mdev   100711151.3 B/s (  9.0%), confid. +/-    60333655.0 B/s (  5.4%)
  15   1176744840.8 B/s,  avg  1125892882.3 B/s, mdev    98240838.2 B/s (  8.7%), confid. +/-    56319052.3 B/s (  5.0%)
  16   1176593778.5 B/s,  avg  1129061688.3 B/s, mdev    95909740.8 B/s (  8.5%), confid. +/-    52771633.5 B/s (  4.7%)
  17   1176583967.4 B/s,  avg  1131857116.5 B/s, mdev    93715582.2 B/s (  8.3%), confid. +/-    49669258.6 B/s (  4.4%)
  18   1176853301.8 B/s,  avg  1134356904.5 B/s, mdev    91656530.2 B/s (  8.1%), confid. +/-    46905244.8 B/s (  4.1%)
  19   1176592845.7 B/s,  avg  1136579848.8 B/s, mdev    89709043.8 B/s (  7.9%), confid. +/-    44424855.9 B/s (  3.9%)
  20   1176608117.3 B/s,  avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
  all                     avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)

Each line shows result of one 30 second long test and average, mean
deviation and 99% confidence interval half width through the iterations
so far. While 17 iteration results are essentially the wire speed minus
TCP overhead, iterations 1, 3 and 12 are more than 20% lower. As results
of the same test on unpatched 5.12-rc7 are much more consistent (the
lowest iteration result through the whole test was 1175939718.3 and the
mean deviation only 276889.1 B/s), it doesn't seeem to be just a random
fluctuation.

I'll try to find out what happens in these outstanding iterations.

Michal
Yunsheng Lin April 21, 2021, 1:52 a.m. UTC | #5
On 2021/4/21 4:34, Michal Kubecek wrote:
> On Tue, Apr 20, 2021 at 01:55:03AM +0200, Michal Kubecek wrote:
>> On Mon, Apr 19, 2021 at 05:29:46PM +0200, Michal Kubecek wrote:
>>>
>>> As pointed out in the discussion on v3, this patch may result in
>>> significantly higher CPU consumption with multiple threads competing on
>>> a saturated outgoing device. I missed this submission so that I haven't
>>> checked it yet but given the description of v3->v4 changes above, it's
>>> quite likely that it suffers from the same problem.
>>
>> And it indeed does. However, with the additional patch from the v3
>> discussion, the numbers are approximately the same as with an unpatched
>> mainline kernel.
>>
>> As with v4, I tried this patch on top of 5.12-rc7 with real devices.
>> I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs
>> (2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has
>> 48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues.
>>
>>   threads    5.12-rc7    5.12-rc7 + v4    5.12-rc7 + v4 + stop
>>      1        25.1%          38.1%            22.9%
>>      8        66.2%         277.0%            74.1%
>>     16        90.1%         150.7%            91.0%
>>     32       107.2%         272.6%           108.3%
>>     64       116.3%         487.5%           118.1%
>>    128       126.1%         946.7%           126.9%
>>
>> (The values are normalized to one core, i.e. 100% corresponds to one
>> fully used logical CPU.)
> 
> I repeated the tests few more times and with more iterations and it
> seems the problem rather was that the CPU utilization numbers are not
> very stable, in particular with number of connections/threads close to
> the number of CPUs and Tx queues. Refined results (and also other tests)
> show that full 3-patch series performs similar to unpatched 5.12-rc7
> (within the margin of statistical error).

Thanks for the clarifying.
I did a similar test yesterday(using 10Gb netdev and the iperf tool I has),
the initial result suggested the same.

> 
> However, I noticed something disturbing in the results of a simple
> 1-thread TCP_STREAM test (client sends data through a TCP connection to
> server using long writes, we measure the amount of data received by the
> server):
> 
>   server: 172.17.1.1, port 12543
>   iterations: 20, threads: 1, test length: 30
>   test: TCP_STREAM, message size: 1048576
>   
>   1     927403548.4 B/s,  avg   927403548.4 B/s, mdev           0.0 B/s (  0.0%)
>   2    1176317172.1 B/s,  avg  1051860360.2 B/s, mdev   124456811.8 B/s ( 11.8%), confid. +/-  1581348251.3 B/s (150.3%)
>   3     927335837.8 B/s,  avg  1010352186.1 B/s, mdev   117354970.3 B/s ( 11.6%), confid. +/-   357073677.2 B/s ( 35.3%)
>   4    1176728045.1 B/s,  avg  1051946150.8 B/s, mdev   124576544.7 B/s ( 11.8%), confid. +/-   228863127.8 B/s ( 21.8%)
>   5    1176788216.3 B/s,  avg  1076914563.9 B/s, mdev   122102985.3 B/s ( 11.3%), confid. +/-   169478943.5 B/s ( 15.7%)
>   6    1158167055.1 B/s,  avg  1090456645.8 B/s, mdev   115504209.5 B/s ( 10.6%), confid. +/-   132805140.8 B/s ( 12.2%)
>   7    1176243474.4 B/s,  avg  1102711907.0 B/s, mdev   111069717.1 B/s ( 10.1%), confid. +/-   110956822.2 B/s ( 10.1%)
>   8    1176771142.8 B/s,  avg  1111969311.5 B/s, mdev   106744173.5 B/s (  9.6%), confid. +/-    95417120.0 B/s (  8.6%)
>   9    1176206364.6 B/s,  avg  1119106761.8 B/s, mdev   102644185.2 B/s (  9.2%), confid. +/-    83685200.5 B/s (  7.5%)
>   10   1175888409.4 B/s,  avg  1124784926.6 B/s, mdev    98855550.5 B/s (  8.8%), confid. +/-    74537085.1 B/s (  6.6%)
>   11   1176541407.6 B/s,  avg  1129490061.2 B/s, mdev    95422224.8 B/s (  8.4%), confid. +/-    67230249.7 B/s (  6.0%)
>   12    934185352.8 B/s,  avg  1113214668.9 B/s, mdev   106114984.5 B/s (  9.5%), confid. +/-    70420712.5 B/s (  6.3%)
>   13   1176550558.1 B/s,  avg  1118086660.3 B/s, mdev   103339448.9 B/s (  9.2%), confid. +/-    65002902.4 B/s (  5.8%)
>   14   1176521808.8 B/s,  avg  1122260599.5 B/s, mdev   100711151.3 B/s (  9.0%), confid. +/-    60333655.0 B/s (  5.4%)
>   15   1176744840.8 B/s,  avg  1125892882.3 B/s, mdev    98240838.2 B/s (  8.7%), confid. +/-    56319052.3 B/s (  5.0%)
>   16   1176593778.5 B/s,  avg  1129061688.3 B/s, mdev    95909740.8 B/s (  8.5%), confid. +/-    52771633.5 B/s (  4.7%)
>   17   1176583967.4 B/s,  avg  1131857116.5 B/s, mdev    93715582.2 B/s (  8.3%), confid. +/-    49669258.6 B/s (  4.4%)
>   18   1176853301.8 B/s,  avg  1134356904.5 B/s, mdev    91656530.2 B/s (  8.1%), confid. +/-    46905244.8 B/s (  4.1%)
>   19   1176592845.7 B/s,  avg  1136579848.8 B/s, mdev    89709043.8 B/s (  7.9%), confid. +/-    44424855.9 B/s (  3.9%)
>   20   1176608117.3 B/s,  avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
>   all                     avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
> 
> Each line shows result of one 30 second long test and average, mean
> deviation and 99% confidence interval half width through the iterations
> so far. While 17 iteration results are essentially the wire speed minus
> TCP overhead, iterations 1, 3 and 12 are more than 20% lower. As results
> of the same test on unpatched 5.12-rc7 are much more consistent (the
> lowest iteration result through the whole test was 1175939718.3 and the
> mean deviation only 276889.1 B/s), it doesn't seeem to be just a random
> fluctuation.

I think I need to relearn the statistial math to understand the above
"99% confidence interval half width ":)

But the problem do not seems related too much with "99% confidence
interval half width ", but with "mean deviation"?

I tried using netperf, which seems only show throughput of 9415.06
(10^6bits/sec) using 10G netdev. which tool did you used to show the
above number?

> 
> I'll try to find out what happens in these outstanding iterations.

As my understanding, if there is only one thread, the first spin_trylock()
is likely to return true for the xmiting thread, unless the scheduled
net_tx_action() cause the xmiting thread's spin_trylock() returning false?

Thanks for comprehensive testing and analysing.

> 
> Michal
> 
> .
>
Michal Kubecek April 21, 2021, 5:31 a.m. UTC | #6
On Wed, Apr 21, 2021 at 09:52:40AM +0800, Yunsheng Lin wrote:
> On 2021/4/21 4:34, Michal Kubecek wrote:
> > However, I noticed something disturbing in the results of a simple
> > 1-thread TCP_STREAM test (client sends data through a TCP connection to
> > server using long writes, we measure the amount of data received by the
> > server):
> > 
> >   server: 172.17.1.1, port 12543
> >   iterations: 20, threads: 1, test length: 30
> >   test: TCP_STREAM, message size: 1048576
> >   
> >   1     927403548.4 B/s,  avg   927403548.4 B/s, mdev           0.0 B/s (  0.0%)
> >   2    1176317172.1 B/s,  avg  1051860360.2 B/s, mdev   124456811.8 B/s ( 11.8%), confid. +/-  1581348251.3 B/s (150.3%)
> >   3     927335837.8 B/s,  avg  1010352186.1 B/s, mdev   117354970.3 B/s ( 11.6%), confid. +/-   357073677.2 B/s ( 35.3%)
> >   4    1176728045.1 B/s,  avg  1051946150.8 B/s, mdev   124576544.7 B/s ( 11.8%), confid. +/-   228863127.8 B/s ( 21.8%)
> >   5    1176788216.3 B/s,  avg  1076914563.9 B/s, mdev   122102985.3 B/s ( 11.3%), confid. +/-   169478943.5 B/s ( 15.7%)
> >   6    1158167055.1 B/s,  avg  1090456645.8 B/s, mdev   115504209.5 B/s ( 10.6%), confid. +/-   132805140.8 B/s ( 12.2%)
> >   7    1176243474.4 B/s,  avg  1102711907.0 B/s, mdev   111069717.1 B/s ( 10.1%), confid. +/-   110956822.2 B/s ( 10.1%)
> >   8    1176771142.8 B/s,  avg  1111969311.5 B/s, mdev   106744173.5 B/s (  9.6%), confid. +/-    95417120.0 B/s (  8.6%)
> >   9    1176206364.6 B/s,  avg  1119106761.8 B/s, mdev   102644185.2 B/s (  9.2%), confid. +/-    83685200.5 B/s (  7.5%)
> >   10   1175888409.4 B/s,  avg  1124784926.6 B/s, mdev    98855550.5 B/s (  8.8%), confid. +/-    74537085.1 B/s (  6.6%)
> >   11   1176541407.6 B/s,  avg  1129490061.2 B/s, mdev    95422224.8 B/s (  8.4%), confid. +/-    67230249.7 B/s (  6.0%)
> >   12    934185352.8 B/s,  avg  1113214668.9 B/s, mdev   106114984.5 B/s (  9.5%), confid. +/-    70420712.5 B/s (  6.3%)
> >   13   1176550558.1 B/s,  avg  1118086660.3 B/s, mdev   103339448.9 B/s (  9.2%), confid. +/-    65002902.4 B/s (  5.8%)
> >   14   1176521808.8 B/s,  avg  1122260599.5 B/s, mdev   100711151.3 B/s (  9.0%), confid. +/-    60333655.0 B/s (  5.4%)
> >   15   1176744840.8 B/s,  avg  1125892882.3 B/s, mdev    98240838.2 B/s (  8.7%), confid. +/-    56319052.3 B/s (  5.0%)
> >   16   1176593778.5 B/s,  avg  1129061688.3 B/s, mdev    95909740.8 B/s (  8.5%), confid. +/-    52771633.5 B/s (  4.7%)
> >   17   1176583967.4 B/s,  avg  1131857116.5 B/s, mdev    93715582.2 B/s (  8.3%), confid. +/-    49669258.6 B/s (  4.4%)
> >   18   1176853301.8 B/s,  avg  1134356904.5 B/s, mdev    91656530.2 B/s (  8.1%), confid. +/-    46905244.8 B/s (  4.1%)
> >   19   1176592845.7 B/s,  avg  1136579848.8 B/s, mdev    89709043.8 B/s (  7.9%), confid. +/-    44424855.9 B/s (  3.9%)
> >   20   1176608117.3 B/s,  avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
> >   all                     avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
> > 
> > Each line shows result of one 30 second long test and average, mean
> > deviation and 99% confidence interval half width through the iterations
> > so far. While 17 iteration results are essentially the wire speed minus
> > TCP overhead, iterations 1, 3 and 12 are more than 20% lower. As results
> > of the same test on unpatched 5.12-rc7 are much more consistent (the
> > lowest iteration result through the whole test was 1175939718.3 and the
> > mean deviation only 276889.1 B/s), it doesn't seeem to be just a random
> > fluctuation.
> 
> I think I need to relearn the statistial math to understand the above
> "99% confidence interval half width ":)

An easy way to understand it is that if the last column shows 42 MB/s,
it means that with 99% confidence (probability), the measured average
is within 42 MB/s off the actual one.

> But the problem do not seems related too much with "99% confidence
> interval half width ", but with "mean deviation"?

Mean deviation is a symptom here. What worries me is that most results
show the same value (corresponding to fully saturated line) with very
little variation, in some iterations (1, 3 and 12 here) we can suddenly
see much lower value (by ~2.5 GB/s, i.e. 20-25%). And as each iteration
runs the connection for 30 seconds, it cannot be just some short glitch.

I managed to get tcpdump captures yesterday but they are huge even with
"-s 128" (client ~5.6 GB, server ~9.0 GB) so that working with them is
rather slow so I did not find anything interesting yet.

> I tried using netperf, which seems only show throughput of 9415.06
> (10^6bits/sec) using 10G netdev. which tool did you used to show the
> above number?

9415.06 * 10^6 b/s is 1176.9 * 10^6 B/s so it's about the same as the
numbers above (the good ones, that is). As this was part of a longer
test trying different thread counts from 1 to 128, I was using another
utility I started writing recently:

  https://github.com/mkubecek/nperf

It is still raw and a lot of features are missing but it can be already
used for multithreaded TCP_STREAM and TCP_RR tests. In particular, the
output above was with

  nperf -H 172.17.1.1 -l 30 -i 20 --exact -t TCP_STREAM -M 1

The results are with 1 thread so that they should be also reproducible
with netperf too. But it needs to be repeated enough times, when
I wanted to get the packet captures, I did 40 iterations and only two of
them showed lower result.

Michal
Yunsheng Lin April 21, 2021, 8:21 a.m. UTC | #7
On 2021/4/21 13:31, Michal Kubecek wrote:
> On Wed, Apr 21, 2021 at 09:52:40AM +0800, Yunsheng Lin wrote:
>> On 2021/4/21 4:34, Michal Kubecek wrote:
>>> However, I noticed something disturbing in the results of a simple
>>> 1-thread TCP_STREAM test (client sends data through a TCP connection to
>>> server using long writes, we measure the amount of data received by the
>>> server):
>>>
>>>   server: 172.17.1.1, port 12543
>>>   iterations: 20, threads: 1, test length: 30
>>>   test: TCP_STREAM, message size: 1048576
>>>   
>>>   1     927403548.4 B/s,  avg   927403548.4 B/s, mdev           0.0 B/s (  0.0%)
>>>   2    1176317172.1 B/s,  avg  1051860360.2 B/s, mdev   124456811.8 B/s ( 11.8%), confid. +/-  1581348251.3 B/s (150.3%)
>>>   3     927335837.8 B/s,  avg  1010352186.1 B/s, mdev   117354970.3 B/s ( 11.6%), confid. +/-   357073677.2 B/s ( 35.3%)
>>>   4    1176728045.1 B/s,  avg  1051946150.8 B/s, mdev   124576544.7 B/s ( 11.8%), confid. +/-   228863127.8 B/s ( 21.8%)
>>>   5    1176788216.3 B/s,  avg  1076914563.9 B/s, mdev   122102985.3 B/s ( 11.3%), confid. +/-   169478943.5 B/s ( 15.7%)
>>>   6    1158167055.1 B/s,  avg  1090456645.8 B/s, mdev   115504209.5 B/s ( 10.6%), confid. +/-   132805140.8 B/s ( 12.2%)
>>>   7    1176243474.4 B/s,  avg  1102711907.0 B/s, mdev   111069717.1 B/s ( 10.1%), confid. +/-   110956822.2 B/s ( 10.1%)
>>>   8    1176771142.8 B/s,  avg  1111969311.5 B/s, mdev   106744173.5 B/s (  9.6%), confid. +/-    95417120.0 B/s (  8.6%)
>>>   9    1176206364.6 B/s,  avg  1119106761.8 B/s, mdev   102644185.2 B/s (  9.2%), confid. +/-    83685200.5 B/s (  7.5%)
>>>   10   1175888409.4 B/s,  avg  1124784926.6 B/s, mdev    98855550.5 B/s (  8.8%), confid. +/-    74537085.1 B/s (  6.6%)
>>>   11   1176541407.6 B/s,  avg  1129490061.2 B/s, mdev    95422224.8 B/s (  8.4%), confid. +/-    67230249.7 B/s (  6.0%)
>>>   12    934185352.8 B/s,  avg  1113214668.9 B/s, mdev   106114984.5 B/s (  9.5%), confid. +/-    70420712.5 B/s (  6.3%)
>>>   13   1176550558.1 B/s,  avg  1118086660.3 B/s, mdev   103339448.9 B/s (  9.2%), confid. +/-    65002902.4 B/s (  5.8%)
>>>   14   1176521808.8 B/s,  avg  1122260599.5 B/s, mdev   100711151.3 B/s (  9.0%), confid. +/-    60333655.0 B/s (  5.4%)
>>>   15   1176744840.8 B/s,  avg  1125892882.3 B/s, mdev    98240838.2 B/s (  8.7%), confid. +/-    56319052.3 B/s (  5.0%)
>>>   16   1176593778.5 B/s,  avg  1129061688.3 B/s, mdev    95909740.8 B/s (  8.5%), confid. +/-    52771633.5 B/s (  4.7%)
>>>   17   1176583967.4 B/s,  avg  1131857116.5 B/s, mdev    93715582.2 B/s (  8.3%), confid. +/-    49669258.6 B/s (  4.4%)
>>>   18   1176853301.8 B/s,  avg  1134356904.5 B/s, mdev    91656530.2 B/s (  8.1%), confid. +/-    46905244.8 B/s (  4.1%)
>>>   19   1176592845.7 B/s,  avg  1136579848.8 B/s, mdev    89709043.8 B/s (  7.9%), confid. +/-    44424855.9 B/s (  3.9%)
>>>   20   1176608117.3 B/s,  avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
>>>   all                     avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
>>>
>>> Each line shows result of one 30 second long test and average, mean
>>> deviation and 99% confidence interval half width through the iterations
>>> so far. While 17 iteration results are essentially the wire speed minus
>>> TCP overhead, iterations 1, 3 and 12 are more than 20% lower. As results
>>> of the same test on unpatched 5.12-rc7 are much more consistent (the
>>> lowest iteration result through the whole test was 1175939718.3 and the
>>> mean deviation only 276889.1 B/s), it doesn't seeem to be just a random
>>> fluctuation.
>>
>> I think I need to relearn the statistial math to understand the above
>> "99% confidence interval half width ":)
> 
> An easy way to understand it is that if the last column shows 42 MB/s,
> it means that with 99% confidence (probability), the measured average
> is within 42 MB/s off the actual one.
> 
>> But the problem do not seems related too much with "99% confidence
>> interval half width ", but with "mean deviation"?
> 
> Mean deviation is a symptom here. What worries me is that most results
> show the same value (corresponding to fully saturated line) with very
> little variation, in some iterations (1, 3 and 12 here) we can suddenly
> see much lower value (by ~2.5 GB/s, i.e. 20-25%). And as each iteration
> runs the connection for 30 seconds, it cannot be just some short glitch.
> 
> I managed to get tcpdump captures yesterday but they are huge even with
> "-s 128" (client ~5.6 GB, server ~9.0 GB) so that working with them is
> rather slow so I did not find anything interesting yet.
> 
>> I tried using netperf, which seems only show throughput of 9415.06
>> (10^6bits/sec) using 10G netdev. which tool did you used to show the
>> above number?
> 
> 9415.06 * 10^6 b/s is 1176.9 * 10^6 B/s so it's about the same as the
> numbers above (the good ones, that is). As this was part of a longer
> test trying different thread counts from 1 to 128, I was using another
> utility I started writing recently:

I tried using below shell to simulate your testcase:

#!/bin/sh

for((i=0; i<20; i++))
do
        taskset -c 0-31 netperf -t TCP_STREAM -H 192.168.100.2 -l 30 -- -m 1048576
done

And I got a quite stable result: 9413~9416 (10^6bits/sec) for 10G netdev.

> 
>   https://github.com/mkubecek/nperf
> 
> It is still raw and a lot of features are missing but it can be already
> used for multithreaded TCP_STREAM and TCP_RR tests. In particular, the
> output above was with
> 
>   nperf -H 172.17.1.1 -l 30 -i 20 --exact -t TCP_STREAM -M 1

I tried your nperf too, unfortunately it does not seem to work on my
system(arm64), which exits very quickly and output the blow result:

root@(none):/home# nperf -H 192.168.100.2 -l 30 -i 20 --exact -t TCP_STREAM -M 1
server: 192.168.100.2, port 12543
iterations: 20, threads: 1, test length: 30
test: TCP_STREAM, message size: 1048576

1             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%)
2             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
3             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           nan B/s (  nan%)
4             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
5             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
6             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
7             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
8             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
9             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
10            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
11            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
12            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
13            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
14            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
15            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
16            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
17            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
18            4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
19            4.0 B/s,  avg           4.0 B/s, mdev           nan B/s (  nan%), confid. +/-           nan B/s (  nan%)
20            4.0 B/s,  avg           4.0 B/s, mdev           nan B/s (  nan%), confid. +/-           nan B/s (  nan%)
all                     avg           4.0 B/s, mdev           nan B/s (  nan%), confid. +/-           nan B/s (  nan%)

> 
> The results are with 1 thread so that they should be also reproducible
> with netperf too. But it needs to be repeated enough times, when
> I wanted to get the packet captures, I did 40 iterations and only two of
> them showed lower result.
> 
> Michal
> 
> .
>
Michal Kubecek April 21, 2021, 8:44 a.m. UTC | #8
On Wed, Apr 21, 2021 at 04:21:54PM +0800, Yunsheng Lin wrote:
> 
> I tried using below shell to simulate your testcase:
> 
> #!/bin/sh
> 
> for((i=0; i<20; i++))
> do
>         taskset -c 0-31 netperf -t TCP_STREAM -H 192.168.100.2 -l 30 -- -m 1048576
> done
> 
> And I got a quite stable result: 9413~9416 (10^6bits/sec) for 10G netdev.

Perhaps try it without the taskset, in my test, there was only one
connection.

> > 
> >   https://github.com/mkubecek/nperf
> > 
> > It is still raw and a lot of features are missing but it can be already
> > used for multithreaded TCP_STREAM and TCP_RR tests. In particular, the
> > output above was with
> > 
> >   nperf -H 172.17.1.1 -l 30 -i 20 --exact -t TCP_STREAM -M 1
> 
> I tried your nperf too, unfortunately it does not seem to work on my
> system(arm64), which exits very quickly and output the blow result:
> 
> root@(none):/home# nperf -H 192.168.100.2 -l 30 -i 20 --exact -t TCP_STREAM -M 1
> server: 192.168.100.2, port 12543
> iterations: 20, threads: 1, test length: 30
> test: TCP_STREAM, message size: 1048576
> 
> 1             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%)
[...]

Did you start nperfd on the other side? (It plays a role similar to
netserver for netperf.) Few days ago I noticed that there is something
wrong with error handling in case of failed connection but didn't get to
fixing it yet.

I'll try running some tests also on other architectures, including arm64
and s390x (to catch potential endinanity issues).

Michal
Yunsheng Lin April 21, 2021, 9:25 a.m. UTC | #9
On 2021/4/21 16:44, Michal Kubecek wrote:
> On Wed, Apr 21, 2021 at 04:21:54PM +0800, Yunsheng Lin wrote:
>>
>> I tried using below shell to simulate your testcase:
>>
>> #!/bin/sh
>>
>> for((i=0; i<20; i++))
>> do
>>         taskset -c 0-31 netperf -t TCP_STREAM -H 192.168.100.2 -l 30 -- -m 1048576
>> done
>>
>> And I got a quite stable result: 9413~9416 (10^6bits/sec) for 10G netdev.
> 
> Perhaps try it without the taskset, in my test, there was only one
> connection.

Just tried, and got the similar result as above.

> 
>>>
>>>   https://github.com/mkubecek/nperf
>>>
>>> It is still raw and a lot of features are missing but it can be already
>>> used for multithreaded TCP_STREAM and TCP_RR tests. In particular, the
>>> output above was with
>>>
>>>   nperf -H 172.17.1.1 -l 30 -i 20 --exact -t TCP_STREAM -M 1
>>
>> I tried your nperf too, unfortunately it does not seem to work on my
>> system(arm64), which exits very quickly and output the blow result:
>>
>> root@(none):/home# nperf -H 192.168.100.2 -l 30 -i 20 --exact -t TCP_STREAM -M 1
>> server: 192.168.100.2, port 12543
>> iterations: 20, threads: 1, test length: 30
>> test: TCP_STREAM, message size: 1048576
>>
>> 1             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%)
> [...]
> 
> Did you start nperfd on the other side? (It plays a role similar to
> netserver for netperf.) Few days ago I noticed that there is something
> wrong with error handling in case of failed connection but didn't get to
> fixing it yet.

Yes, I did. If I did not start nperfd, I got "connect: Connection refused"
as below:

nperf -H 192.168.100.2 -l 30 -i 20 --exact -t TCP_STREAM -M 1
server: 192.168.100.2, port 12543
iterations: 20, threads: 1, test length: 30
test: TCP_STREAM, message size: 1048576

connect: Connection refused
failed to connect to '192.168.100.2'
1             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%)
connect: Connection refused
failed to connect to '192.168.100.2'
2             4.0 B/s,  avg           4.0 B/s, mdev           0.0 B/s (  0.0%), confid. +/-           0.0 B/s (  0.0%)
connect: Connection refused
failed to connect to '192.168.100.2'



> 
> I'll try running some tests also on other architectures, including arm64
> and s390x (to catch potential endinanity issues).
> 
> Michal
> 
> .
>
Yunsheng Lin April 23, 2021, 9:42 a.m. UTC | #10
On 2021/4/21 17:25, Yunsheng Lin wrote:
> On 2021/4/21 16:44, Michal Kubecek wrote:
> 
>>
>> I'll try running some tests also on other architectures, including arm64
>> and s390x (to catch potential endinanity issues).

I tried debugging nperf in arm64, with the below patch:
diff --git a/client/main.c b/client/main.c
index 429634d..de1a3ef 100644
--- a/client/main.c
+++ b/client/main.c
@@ -63,7 +63,10 @@ static int client_init(void)
        ret = client_set_usr1_handler();
        if (ret < 0)
                return ret;
-       return ignore_signal(SIGPIPE);
+       //return ignore_signal(SIGPIPE);
+       signal(SIGPIPE, SIG_IGN);
+
+       return 0;
 }

 static int ctrl_send_start(struct client_config *config)
diff --git a/client/worker.c b/client/worker.c
index ac026893..d269311 100644
--- a/client/worker.c
+++ b/client/worker.c
@@ -7,7 +7,7 @@
 #include "worker.h"
 #include "main.h"

-#define WORKER_STACK_SIZE 16384
+#define WORKER_STACK_SIZE 131072

 struct client_worker_data *workers_data;
 union sockaddr_any test_addr;

It has below error output:

../nperf/nperf -H 127.0.0.1 -l 3 -i 1 --exact -t TCP_STREAM -M 1
server: 127.0.0.1, port 12543
iterations: 1, threads: 1, test length: 3
test: TCP_STREAM, message size: 1048576

run test begin
send begin
send done: -32
failed to receive server stats
*** Iteration 1 failed, quitting. ***


Tcpdump has below output:
09:55:12.253341 IP localhost.53080 > localhost.12543: Flags [S], seq 3954442980, win 65495, options [mss 65495,sackOK,TS val 3268837738 ecr 0,nop,wscale 7], length 0
09:55:12.253363 IP localhost.12543 > localhost.53080: Flags [S.], seq 4240541653, ack 3954442981, win 65483, options [mss 65495,sackOK,TS val 3268837738 ecr 3268837738,nop,wscale 7], length 0
09:55:12.253379 IP localhost.53080 > localhost.12543: Flags [.], ack 1, win 512, options [nop,nop,TS val 3268837738 ecr 3268837738], length 0
09:55:12.253412 IP localhost.53080 > localhost.12543: Flags [P.], seq 1:29, ack 1, win 512, options [nop,nop,TS val 3268837738 ecr 3268837738], length 28
09:55:12.253863 IP localhost.12543 > localhost.53080: Flags [P.], seq 1:17, ack 29, win 512, options [nop,nop,TS val 3268837739 ecr 3268837738], length 16
09:55:12.253891 IP localhost.53080 > localhost.12543: Flags [.], ack 17, win 512, options [nop,nop,TS val 3268837739 ecr 3268837739], length 0
09:55:12.254265 IP localhost.12543 > localhost.53080: Flags [F.], seq 17, ack 29, win 512, options [nop,nop,TS val 3268837739 ecr 3268837739], length 0
09:55:12.301992 IP localhost.53080 > localhost.12543: Flags [.], ack 18, win 512, options [nop,nop,TS val 3268837787 ecr 3268837739], length 0
09:55:15.254389 IP localhost.53080 > localhost.12543: Flags [F.], seq 29, ack 18, win 512, options [nop,nop,TS val 3268840739 ecr 3268837739], length 0
09:55:15.254426 IP localhost.12543 > localhost.53080: Flags [.], ack 30, win 512, options [nop,nop,TS val 3268840739 ecr 3268840739], length 0


Any idea what went wrong here?

Also, Would you mind running netperf to see if there is similar issue
in your system?

>>
>> Michal
>>
>> .
>>
> 
> 
> .
>
Yunsheng Lin April 30, 2021, 3:11 a.m. UTC | #11
On 2021/4/23 17:42, Yunsheng Lin wrote:
> On 2021/4/21 17:25, Yunsheng Lin wrote:
>> On 2021/4/21 16:44, Michal Kubecek wrote:
>>
>>>
>>> I'll try running some tests also on other architectures, including arm64
>>> and s390x (to catch potential endinanity issues).
> 
> I tried debugging nperf in arm64, with the below patch:
>
> Any idea what went wrong here?
> 
> Also, Would you mind running netperf to see if there is similar issue
> in your system?

Hi, Michal
    I was able to reproduce the fluctuation for one thread TCP_STREAM test,
So I am assuming it may more related to test environment or nperf issue.

   I plan to send v5 with netdev queue stopped handling after the golden
holiday in china. If there is any issue with patchset, please let me know,
thanks.

> 
>>>
>>> Michal
>>>
>>> .
>>>
>>
>>
>> .
>>
> 
> 
> .
>
Yunsheng Lin April 30, 2021, 3:15 a.m. UTC | #12
On 2021/4/30 11:11, Yunsheng Lin wrote:
> On 2021/4/23 17:42, Yunsheng Lin wrote:
>> On 2021/4/21 17:25, Yunsheng Lin wrote:
>>> On 2021/4/21 16:44, Michal Kubecek wrote:
>>>
>>>>
>>>> I'll try running some tests also on other architectures, including arm64
>>>> and s390x (to catch potential endinanity issues).
>>
>> I tried debugging nperf in arm64, with the below patch:
>>
>> Any idea what went wrong here?
>>
>> Also, Would you mind running netperf to see if there is similar issue
>> in your system?
> 
> Hi, Michal
>     I was able to reproduce the fluctuation for one thread TCP_STREAM test,

I was *not* able
Sorry for the typo.

> So I am assuming it may more related to test environment or nperf issue.
> 
>    I plan to send v5 with netdev queue stopped handling after the golden
> holiday in china. If there is any issue with patchset, please let me know,
> thanks.
> 
>>
>>>>
>>>> Michal
>>>>
>>>> .
>>>>
>>>
>>>
>>> .
>>>
>>
>>
>> .
>>
> 
> 
> .
>
Michal Kubecek April 30, 2021, 6:28 a.m. UTC | #13
On Fri, Apr 30, 2021 at 11:15:01AM +0800, Yunsheng Lin wrote:
> On 2021/4/30 11:11, Yunsheng Lin wrote:
> > On 2021/4/23 17:42, Yunsheng Lin wrote:
> >> On 2021/4/21 17:25, Yunsheng Lin wrote:
> >>> On 2021/4/21 16:44, Michal Kubecek wrote:
> >>>
> >>>>
> >>>> I'll try running some tests also on other architectures, including arm64
> >>>> and s390x (to catch potential endinanity issues).
> >>
> >> I tried debugging nperf in arm64, with the below patch:
> >>
> >> Any idea what went wrong here?
> >>
> >> Also, Would you mind running netperf to see if there is similar issue
> >> in your system?
> > 
> > Hi, Michal
> >     I was able to reproduce the fluctuation for one thread TCP_STREAM test,
> 
> I was *not* able
> Sorry for the typo.

I was able to reproduce the same problem with netperf:

marfak:~ # for i in {1..60}; do netperf -H 172.17.1.1 -l 30 -t TCP_STREAM -- -m 1048576; done
131072  16384 1048576    30.00    9413.36   
131072  16384 1048576    30.01    7473.68   <---
131072  16384 1048576    30.00    9413.97   
131072  16384 1048576    30.00    9413.76   
131072  16384 1048576    30.01    9024.25   
131072  16384 1048576    30.01    8364.78   
131072  16384 1048576    30.00    9413.22   
131072  16384 1048576    30.00    9414.29   
131072  16384 1048576    30.00    9414.32   
131072  16384 1048576    30.00    9412.58   
131072  16384 1048576    30.00    9412.79   
131072  16384 1048576    30.00    9413.18   
131072  16384 1048576    30.01    8771.57   <---
131072  16384 1048576    30.00    9414.01   
131072  16384 1048576    30.00    9413.93   
131072  16384 1048576    30.00    9413.97   
131072  16384 1048576    30.00    9414.05   
131072  16384 1048576    30.00    9412.92   
131072  16384 1048576    30.00    9413.40   
131072  16384 1048576    30.00    9414.41   
131072  16384 1048576    30.00    9413.25   
131072  16384 1048576    30.00    9413.38   
131072  16384 1048576    30.00    9412.28   
131072  16384 1048576    30.00    9413.50   
131072  16384 1048576    30.00    9414.12   
131072  16384 1048576    30.00    9414.27   
131072  16384 1048576    30.00    9412.96   
131072  16384 1048576    30.00    9413.71   
131072  16384 1048576    30.01    9205.98   
131072  16384 1048576    30.00    9413.69   
131072  16384 1048576    30.00    9413.60   
131072  16384 1048576    30.01    8297.03   <---
131072  16384 1048576    30.00    9414.09   
131072  16384 1048576    30.00    9414.38   
131072  16384 1048576    30.00    9413.62   
131072  16384 1048576    30.00    9411.09   
131072  16384 1048576    30.00    9414.37   
131072  16384 1048576    30.00    9414.37   
131072  16384 1048576    30.00    9412.52   
131072  16384 1048576    30.00    9414.06   
131072  16384 1048576    30.00    9413.66   
131072  16384 1048576    30.00    9411.63   
131072  16384 1048576    30.00    9414.17   
131072  16384 1048576    30.00    9414.07   
131072  16384 1048576    30.00    9414.09   
131072  16384 1048576    30.00    9414.37   
131072  16384 1048576    30.00    9390.00   
131072  16384 1048576    30.00    9413.72   
131072  16384 1048576    30.01    9260.97   
131072  16384 1048576    30.01    9334.91   
131072  16384 1048576    30.00    9413.57   
131072  16384 1048576    30.00    9412.01   
131072  16384 1048576    30.00    9414.36   
131072  16384 1048576    30.00    9412.47   
131072  16384 1048576    30.00    9413.73   
131072  16384 1048576    30.00    9413.48   
131072  16384 1048576    30.00    9413.36   
131072  16384 1048576    30.01    9327.42   
131072  16384 1048576    30.01    9240.33   
131072  16384 1048576    30.00    9413.97   

(filtered only the interesting lines)

But after some more testing, I was also able to see similar results with
unpatched mainline kernel:

131072  16384 1048576    30.00    9413.28   
131072  16384 1048576    30.01    9007.17   
131072  16384 1048576    30.01    9153.22   
131072  16384 1048576    30.00    9414.28   
131072  16384 1048576    30.01    9244.68   
131072  16384 1048576    30.01    9230.49   
131072  16384 1048576    30.00    8723.24   <---
131072  16384 1048576    30.01    8289.21   <---
131072  16384 1048576    30.01    9258.33   
131072  16384 1048576    30.00    9251.47   
131072  16384 1048576    30.00    9414.23   
131072  16384 1048576    30.01    9276.87   
131072  16384 1048576    30.01    9255.61   
131072  16384 1048576    30.00    9072.78   
131072  16384 1048576    30.00    9412.09   
131072  16384 1048576    30.01    9393.00   
131072  16384 1048576    30.00    9413.39   
131072  16384 1048576    30.01    9404.01   
131072  16384 1048576    30.01    8412.83   <---
131072  16384 1048576    30.01    9368.23   
131072  16384 1048576    30.01    9259.11   
131072  16384 1048576    30.01    9121.65   
131072  16384 1048576    30.01    9169.87   
131072  16384 1048576    30.01    9154.03   
131072  16384 1048576    30.01    9336.34   
131072  16384 1048576    30.00    9187.73   
131072  16384 1048576    30.00    9412.54   
131072  16384 1048576    30.01    6836.37   <---
131072  16384 1048576    30.01    9388.09   
131072  16384 1048576    30.01    8755.78   <---
131072  16384 1048576    30.01    9167.63   
131072  16384 1048576    30.00    9410.80   
131072  16384 1048576    30.01    9392.71   
131072  16384 1048576    30.01    9238.50   
131072  16384 1048576    30.01    9382.78   
131072  16384 1048576    30.01    9328.23   
131072  16384 1048576    30.01    9396.04   
131072  16384 1048576    30.01    9286.10   
131072  16384 1048576    30.00    9412.44   
131072  16384 1048576    30.01    7952.34   <---
131072  16384 1048576    30.01    9309.95   
131072  16384 1048576    30.00    9133.38   
131072  16384 1048576    30.01    8672.75   
131072  16384 1048576    30.00    9414.28   
131072  16384 1048576    30.00    9411.34   
131072  16384 1048576    30.00    9414.27   
131072  16384 1048576    30.01    9313.60   
131072  16384 1048576    30.01    9315.10   
131072  16384 1048576    30.00    9413.23   
131072  16384 1048576    30.01    9285.77   
131072  16384 1048576    30.00    9414.28   
131072  16384 1048576    30.00    9406.39   
131072  16384 1048576    30.01    9343.74   
131072  16384 1048576    30.01    9179.17   
131072  16384 1048576    30.01    9081.18   
131072  16384 1048576    30.00    9412.85   
131072  16384 1048576    30.00    9413.66   
131072  16384 1048576    30.01    9346.16   
131072  16384 1048576    30.00    9410.01   
131072  16384 1048576    30.00    9411.22   

It's not clear why I haven't seen these before but the problem is
unlikely to by related to your patch set.

Michal
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..b85b8ea 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@  struct qdisc_rate_table {
 enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
+	__QDISC_STATE_MISSED,
 };
 
 struct qdisc_size_table {
@@ -159,8 +160,37 @@  static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
+		bool dont_retry = test_bit(__QDISC_STATE_MISSED,
+					   &qdisc->state);
+
+		if (spin_trylock(&qdisc->seqlock))
+			goto nolock_empty;
+
+		/* If the flag is set before doing the spin_trylock() and
+		 * the above spin_trylock() return false, it means other cpu
+		 * holding the lock will do dequeuing for us, or it wil see
+		 * the flag set after releasing lock and reschedule the
+		 * net_tx_action() to do the dequeuing.
+		 */
+		if (dont_retry)
+			return false;
+
+		/* We could do set_bit() before the first spin_trylock(),
+		 * and avoid doing second spin_trylock() completely, then
+		 * we could have multi cpus doing the set_bit(). Here use
+		 * dont_retry to avoid doing the set_bit() and the second
+		 * spin_trylock(), which has 5% performance improvement than
+		 * doing the set_bit() before the first spin_trylock().
+		 */
+		set_bit(__QDISC_STATE_MISSED, &qdisc->state);
+
+		/* Retry again in case other CPU may not see the new flag
+		 * after it releases the lock at the end of qdisc_run_end().
+		 */
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
+
+nolock_empty:
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
@@ -176,8 +206,13 @@  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
-	if (qdisc->flags & TCQ_F_NOLOCK)
+	if (qdisc->flags & TCQ_F_NOLOCK) {
 		spin_unlock(&qdisc->seqlock);
+
+		if (unlikely(test_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state)))
+			__netif_schedule(qdisc);
+	}
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..9bc73ea 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -640,8 +640,10 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 	struct sk_buff *skb = NULL;
+	bool need_retry = true;
 	int band;
 
+retry:
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
@@ -652,6 +654,16 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	}
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
+	} else if (need_retry &&
+		   test_and_clear_bit(__QDISC_STATE_MISSED,
+				      &qdisc->state)) {
+		/* do another dequeuing after clearing the flag to
+		 * avoid calling __netif_schedule().
+		 */
+		smp_mb__after_atomic();
+		need_retry = false;
+
+		goto retry;
 	} else {
 		WRITE_ONCE(qdisc->empty, true);
 	}