diff mbox

[BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s

Message ID 1384710098.8604.58.camel@edumazet-glaptop2.roam.corp.google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Dumazet Nov. 17, 2013, 5:41 p.m. UTC
On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:

> 
> So it is fairly possible that in your case you can't fill the link if you
> consume too many descriptors. For example, if your server uses TCP_NODELAY
> and sends incomplete segments (which is quite common), it's very easy to
> run out of descriptors before the link is full.

BTW I have a very simple patch for TCP stack that could help this exact
situation...

Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.

Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...

Comments

Arnaud Ebalard Nov. 19, 2013, 6:44 a.m. UTC | #1
Hi,

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
>
>> 
>> So it is fairly possible that in your case you can't fill the link if you
>> consume too many descriptors. For example, if your server uses TCP_NODELAY
>> and sends incomplete segments (which is quite common), it's very easy to
>> run out of descriptors before the link is full.
>
> BTW I have a very simple patch for TCP stack that could help this exact
> situation...
>
> Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> very small frames, and let tcp_sendmsg() have more chance to fill
> complete packets.
>
> Again, for this to work very well, you need that NIC performs TX
> completion in reasonable amount of time...
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3dc0c6c..10456cf 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
>  {
>  	if (tcp_send_head(sk)) {
>  		struct tcp_sock *tp = tcp_sk(sk);
> +		struct sk_buff *skb = tcp_write_queue_tail(sk);
>  
>  		if (!(flags & MSG_MORE) || forced_push(tp))
> -			tcp_mark_push(tp, tcp_write_queue_tail(sk));
> +			tcp_mark_push(tp, skb);
>  
>  		tcp_mark_urg(tp, flags);
> -		__tcp_push_pending_frames(sk, mss_now,
> -					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
> +		if (flags & MSG_MORE)
> +			nonagle = TCP_NAGLE_CORK;
> +		if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
> +			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> +			nonagle = TCP_NAGLE_CORK;
> +		}
> +		__tcp_push_pending_frames(sk, mss_now, nonagle);
>  	}
>  }

I did some test regarding mvneta perf on current linus tree (commit
2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
armada-370: fix tclk frequencies"). Kernel has some debug options
enabled and the patch above is not applied. I will spend some time on
this two directions this evening. The idea was to get some numbers on
the impact of TCP send window size and tcp_limit_output_bytes for
mvneta. 

The test is done with a laptop (Debian, 3.11.0, e1000e) directly
connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
is running Debian armhf with an Apache2 serving a 1GB file from ext4
over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
a simple wget w/ -O /dev/null option.

With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
@1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
(cannot remember the kernel version but sth between 3.8 and 3.10.

So with that setup:

w/ TCP send window set to   4MB: 17.4 MB/s
w/ TCP send window set to   2MB: 16.2 MB/s
w/ TCP send window set to   1MB: 15.6 MB/s
w/ TCP send window set to 512KB: 25.6 MB/s
w/ TCP send window set to 256KB: 57.7 MB/s
w/ TCP send window set to 128KB: 54.0 MB/s
w/ TCP send window set to  64KB: 46.2 MB/s
w/ TCP send window set to  32KB: 42.8 MB/s

Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
w/ TCP send window set to 256KB:

tcp_limit_output_bytes set to 512KB: 59.3 MB/s
tcp_limit_output_bytes set to 256KB: 58.5 MB/s
tcp_limit_output_bytes set to 128KB: 56.2 MB/s
tcp_limit_output_bytes set to  64KB: 32.1 MB/s
tcp_limit_output_bytes set to  32KB: 4.76 MB/s

As a side note, during the test, I sometimes gets peak for some seconds
at the beginning at 90MB/s which tend to confirm what WIlly wrote,
i.e. that the hardware can do more.

Cheers,

a+
Eric Dumazet Nov. 19, 2013, 1:53 p.m. UTC | #2
On Tue, 2013-11-19 at 07:44 +0100, Arnaud Ebalard wrote:

> I did some test regarding mvneta perf on current linus tree (commit
> 2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
> reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
> armada-370: fix tclk frequencies"). Kernel has some debug options
> enabled and the patch above is not applied. I will spend some time on
> this two directions this evening. The idea was to get some numbers on
> the impact of TCP send window size and tcp_limit_output_bytes for
> mvneta.

Note the last patch I sent was not relevant to your problem, do not
bother trying it. Its useful for applications doing lot of consecutive
short writes, like interactive ssh launching line buffered commands.

>  
> 
> The test is done with a laptop (Debian, 3.11.0, e1000e) directly
> connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
> is running Debian armhf with an Apache2 serving a 1GB file from ext4
> over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
> a simple wget w/ -O /dev/null option.
> 
> With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
> @1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
> (cannot remember the kernel version but sth between 3.8 and 3.10.
> 
> So with that setup:
> 
> w/ TCP send window set to   4MB: 17.4 MB/s
> w/ TCP send window set to   2MB: 16.2 MB/s
> w/ TCP send window set to   1MB: 15.6 MB/s
> w/ TCP send window set to 512KB: 25.6 MB/s
> w/ TCP send window set to 256KB: 57.7 MB/s
> w/ TCP send window set to 128KB: 54.0 MB/s
> w/ TCP send window set to  64KB: 46.2 MB/s
> w/ TCP send window set to  32KB: 42.8 MB/s

One of the problem is that tcp_sendmsg() holds the socket lock for the
whole duration of the system call if it has not to sleep. This model
doesnt allow for incoming ACKS to be processed (they are put in socket
backlog and will be processed at socket release time), and TX completion
to also queue the next chunk.

These strange results you have tend to show that if you have a big TCP
send window, the web server pushes a lot of bytes per system call and
might stall the ACK clocking or TX refills.

> 
> Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
> w/ TCP send window set to 256KB:
> 
> tcp_limit_output_bytes set to 512KB: 59.3 MB/s
> tcp_limit_output_bytes set to 256KB: 58.5 MB/s
> tcp_limit_output_bytes set to 128KB: 56.2 MB/s
> tcp_limit_output_bytes set to  64KB: 32.1 MB/s
> tcp_limit_output_bytes set to  32KB: 4.76 MB/s
> 
> As a side note, during the test, I sometimes gets peak for some seconds
> at the beginning at 90MB/s which tend to confirm what WIlly wrote,
> i.e. that the hardware can do more.

I would also check the receiver. I suspect packets drops because of a
bad driver doing skb->truesize overshooting.

nstat >/dev/null ; wget .... ; nstat
Willy Tarreau Nov. 19, 2013, 5:43 p.m. UTC | #3
Hi Eric,

On Tue, Nov 19, 2013 at 05:53:14AM -0800, Eric Dumazet wrote:
> These strange results you have tend to show that if you have a big TCP
> send window, the web server pushes a lot of bytes per system call and
> might stall the ACK clocking or TX refills.

It's tx refills which are not done in this case from what I think I
understood in the driver. IIRC, the refill is done once at the beginning
of xmit and in the tx timer callback. So if you have too large a window
that fills the descriptors during a few tx calls during which no desc was
released, you could end up having to wait for the timer since you're not
allowed to send anymore.

> > Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
> > w/ TCP send window set to 256KB:
> > 
> > tcp_limit_output_bytes set to 512KB: 59.3 MB/s
> > tcp_limit_output_bytes set to 256KB: 58.5 MB/s
> > tcp_limit_output_bytes set to 128KB: 56.2 MB/s
> > tcp_limit_output_bytes set to  64KB: 32.1 MB/s
> > tcp_limit_output_bytes set to  32KB: 4.76 MB/s
> > 
> > As a side note, during the test, I sometimes gets peak for some seconds
> > at the beginning at 90MB/s which tend to confirm what WIlly wrote,
> > i.e. that the hardware can do more.
> 
> I would also check the receiver. I suspect packets drops because of a
> bad driver doing skb->truesize overshooting.

When I first observed the issue, at first I suspected my laptop's driver
when I saw this problem, so I tried with a dockstar instead and the issue
disappeared... until I increased the tcp_rmem on it to match my laptop :-)

Arnaud, you might be interested in trying checking if the following change
does something for you in mvneta.c :

- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)

This can only have any effect if you run at 250 or 1000 Hz, but not at 100
of course. It should reduce the time to first IRQ.

Willy
Eric Dumazet Nov. 19, 2013, 6:31 p.m. UTC | #4
On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:

> - #define MVNETA_TX_DONE_TIMER_PERIOD 10
> + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
> 

I suggested this in a prior mail :

#define MVNETA_TX_DONE_TIMER_PERIOD 1

But apparently it was triggering strange crashes...

> This can only have any effect if you run at 250 or 1000 Hz, but not at 100
> of course. It should reduce the time to first IRQ.
Willy Tarreau Nov. 19, 2013, 6:41 p.m. UTC | #5
On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote:
> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:
> 
> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10
> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
> > 
> 
> I suggested this in a prior mail :
> 
> #define MVNETA_TX_DONE_TIMER_PERIOD 1

Ah sorry, I remember now.

> But apparently it was triggering strange crashes...

Ah, when a bug hides another one, it's the situation I prefer, because
by working on one, you end up fixing two :-)

Cheers,
Willy
Arnaud Ebalard Nov. 19, 2013, 11:53 p.m. UTC | #6
Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote:
>> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote:
>> 
>> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10
>> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
>> > 
>> 
>> I suggested this in a prior mail :
>> 
>> #define MVNETA_TX_DONE_TIMER_PERIOD 1
>
> Ah sorry, I remember now.
>
>> But apparently it was triggering strange crashes...
>
> Ah, when a bug hides another one, it's the situation I prefer, because
> by working on one, you end up fixing two :-)

Follow me just for one sec: today, I got a USB 3.0 Gigabit Ethernet
adapter. More specifically an AX88179-based one (Logitec LAN-GTJU3H3),
about which there is currently a thread on netdev and linux-usb
lists. Anyway, I decided to give it a try on my RN102 just to check what
performance I could achieve. So I basically did the same experiment as
yesterday (wget on client against a 1GB file located on the filesystem
served by an apache on the NAS) except that time the AX88179-based
adapater was used instead of the mvneta-based interface. Well, the
download started at a high rate (90MB/s) but then drops and I get some
SATA error on the NAS (similar to the errors I already got during
12.0-rc series [1] to finally *erroneously* consider it was an artefact).

So I decided to remove the SATA controllers and disks from the equation:
I switched to my ReadyNAS 2120 whose GbE interfaces are also based on
mvneta driver and comes w/ 2GB of RAM. The main additional difference is
that the device is a dual core armada @1.2GHz, where the RN102 is a
single core armada @1.2GHz. I created a dummy 1GB file *in RAM*
(/run/shm) to have it served by the apache2 instead of the file
previously stored on the disks. 

I started w/ todays linus tree (dec8e46178b) with Eric's revert patch
for c9eeec26e32e (tcp: TSQ can use a dynamic limit) and also the change
to mvneta driver to have:

-#define MVNETA_TX_DONE_TIMER_PERIOD    10
+#define MVNETA_TX_DONE_TIMER_PERIOD    1

Here are the average speed given by wget for the following TCP send
window:

   4 MB:  19 MB/s
   2 MB:  21 MB/s
   1 MB:  21 MB/s
  512KB:  23 MB/s
  384KB: 105 MB/s
  256KB: 112 MB/s
  128KB: 111 MB/s
   64KB:  93 MB/s

Then, I decided to redo the exact same test w/o the change on
MVNETA_TX_DONE_TIMER_PERIOD (i.e. w/ the initial value of 10). I get the
exact same results as with the MVNETA_TX_DONE_TIMER_PERIOD set to 1, i.e:

   4 MB:  20 MB/s
   2 MB:  21 MB/s
   1 MB:  21 MB/s
  512KB:  22 MB/s
  384KB: 105 MB/s
  256KB: 112 MB/s
  128KB: 111 MB/s
   64KB:  93 MB/s

And, then, I also dropped Eric's revert patch for c9eeec26e32e (tcp: TSQ
can use a dynamic limit), just to verify we came back where the thread
started but i got a surprise:

   4 MB:  10 MB/s
   2 MB:  11 MB/s
   1 MB:  10 MB/s
  512KB:  12 MB/s
  384KB: 104 MB/s
  256KB: 112 MB/s
  128KB: 112 MB/s
   64KB:  93 MB/s

Instead of the 256KB/s I had observed the low value was now 10MB/s. I
thought it was due to the switch from RN102 to RN2120 so I came back
to the RN102 w/o any specific patch for mvneta nor your revert patch for 
c9eeec26e32e, i.e. only Linus tree as it is today (dec8e46178b). The
file is served from the disk:

   4 MB:   5 MB/s
   2 MB:   5 MB/s
   1 MB:   5 MB/s
  512KB:   5 MB/s
  384KB:  90 MB/s for 4s, then 3 MB/s
  256KB:  80 MB/s for 3s, then 2 MB/s
  128KB:  90 MB/s for 3s, then 3 MB/s
   64KB:  80 MB/s for 3s, then 3 MB/S

Then, I allocated a dummy 400MB file in RAM (/run/shm) and redid the
test on the RN102:

   4 MB:   8 MB/s
   2 MB:   8 MB/s
   1 MB:  92 MB/s
  512KB:  90 MB/s
  384KB:  90 MB/s
  256KB:  90 MB/s
  128KB:  90 MB/s
   64KB:  60 MB/s

In the end, here are the conclusions *I* draw from this test session,
do not hesitate to correct me:

 - Eric, it seems something changed in linus tree betwen the beginning
   of the thread and now, which somehow reduces the effect of the
   regression we were seen: I never got back the 256KB/s.
 - You revert patch still improves the perf a lot
 - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
 - w/ your revert patch, I can confirm that mvneta driver is capable of
   doing line rate w/ proper tweak of TCP send window (256KB instead of
   4M)
 - It seems I will I have to spend some time on the SATA issues I
   previously thought were an artefact of not cleaning my tree during a
   debug session [1], i.e. there is IMHO an issue.

What I do not get is what can cause the perf to drop from 90MB/s to
3MB/s (w/ a 256KB send window) when streaming from the disk instead of
the RAM. I have no issue having dd read from the fs @ 150MB/s and
mvneta streaming from RAM @ 90MB/s but both together get me 3MB/s after
a few seconds.

Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.

Cheers,

a+

[1]: http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508
Eric Dumazet Nov. 20, 2013, 12:08 a.m. UTC | #7
On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:

> Anyway, I think if the thread keeps going on improving mvneta, I'll do
> all additional tests from RAM and will stop polluting netdev w/ possible
> sata/disk/fs issues.

;)

Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.

Thanks !
Willy Tarreau Nov. 20, 2013, 12:35 a.m. UTC | #8
On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> 
> > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > all additional tests from RAM and will stop polluting netdev w/ possible
> > sata/disk/fs issues.
> 
> ;)
> 
> Alternative would be to use netperf or iperf to not use disk at all
> and focus on TCP/network issues only.

Yes, that's for the same reason that I continue to use inject/httpterm
for such purposes :
  - httpterm uses tee()+splice() to send pre-built pages without copying ;
  - inject uses recv(MSG_TRUNC) to ack everything without copying.

Both of them are really interesting to test the hardware's capabilities
and to push components in the middle to their limits without causing too
much burden to the end points.

I don't know if either netperf or iperf can make use of this now, and
I've been used to my tools, but I should take a look again.

Cheers,
Willy
Eric Dumazet Nov. 20, 2013, 12:43 a.m. UTC | #9
On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote:
> On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> > 
> > > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > > all additional tests from RAM and will stop polluting netdev w/ possible
> > > sata/disk/fs issues.
> > 
> > ;)
> > 
> > Alternative would be to use netperf or iperf to not use disk at all
> > and focus on TCP/network issues only.
> 
> Yes, that's for the same reason that I continue to use inject/httpterm
> for such purposes :
>   - httpterm uses tee()+splice() to send pre-built pages without copying ;
>   - inject uses recv(MSG_TRUNC) to ack everything without copying.
> 
> Both of them are really interesting to test the hardware's capabilities
> and to push components in the middle to their limits without causing too
> much burden to the end points.
> 
> I don't know if either netperf or iperf can make use of this now, and
> I've been used to my tools, but I should take a look again.

netperf -t TCP_SENDFILE  does the zero copy at sender.

And more generally -V option does copy avoidance

(Use splice(sockfd -> nullfd) at receiver if I remember well)


Anyway, we should do the normal copy, because it might demonstrate
scheduling problems.

If you want to test raw speed, you could use pktgen ;)
Willy Tarreau Nov. 20, 2013, 12:52 a.m. UTC | #10
On Tue, Nov 19, 2013 at 04:43:48PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote:
> > On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote:
> > > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote:
> > > 
> > > > Anyway, I think if the thread keeps going on improving mvneta, I'll do
> > > > all additional tests from RAM and will stop polluting netdev w/ possible
> > > > sata/disk/fs issues.
> > > 
> > > ;)
> > > 
> > > Alternative would be to use netperf or iperf to not use disk at all
> > > and focus on TCP/network issues only.
> > 
> > Yes, that's for the same reason that I continue to use inject/httpterm
> > for such purposes :
> >   - httpterm uses tee()+splice() to send pre-built pages without copying ;
> >   - inject uses recv(MSG_TRUNC) to ack everything without copying.
> > 
> > Both of them are really interesting to test the hardware's capabilities
> > and to push components in the middle to their limits without causing too
> > much burden to the end points.
> > 
> > I don't know if either netperf or iperf can make use of this now, and
> > I've been used to my tools, but I should take a look again.
> 
> netperf -t TCP_SENDFILE  does the zero copy at sender.
> 
> And more generally -V option does copy avoidance
> 
> (Use splice(sockfd -> nullfd) at receiver if I remember well)

OK thanks for the info.

> Anyway, we should do the normal copy, because it might demonstrate
> scheduling problems.

Yes, especially in this case, though I got the issue with GSO only,
so it might vary as well.

> If you want to test raw speed, you could use pktgen ;)

Except I'm mostly focused on HTTP as you know. And for generating higher
packet rates than pktgen, I have an absolutely ugly patch that I'm a
bit ashamed of for mvneta to multiply the number of emitted descriptors
for a given skb by skb->sk->sk_mark (which I set using setsockopt), this
allows me to generate up to 1.488 Mpps on a USB-powered system, not that
bad in my opinion :-)

Willy
Thomas Petazzoni Nov. 20, 2013, 8:50 a.m. UTC | #11
Arnaud,

On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote:

>  - It seems I will I have to spend some time on the SATA issues I
>    previously thought were an artefact of not cleaning my tree during a
>    debug session [1], i.e. there is IMHO an issue.

I don't remember in detail what your SATA problem was, but just to let
you know that we are currently debugging a problem that occurs on
Armada XP (more than one core is needed for the problem to occur), with
SATA (the symptom is that after some time of SATA usage, SATA traffic is
stalled, and no SATA interrupts are generated anymore). We're still
working on this one, and trying to figure out where the problem is.

Best regards,

Thomas
Willy Tarreau Nov. 20, 2013, 5:12 p.m. UTC | #12
Hi guys,

On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote:
> On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
> 
> > 
> > So it is fairly possible that in your case you can't fill the link if you
> > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > and sends incomplete segments (which is quite common), it's very easy to
> > run out of descriptors before the link is full.
> 
> BTW I have a very simple patch for TCP stack that could help this exact
> situation...
> 
> Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> very small frames, and let tcp_sendmsg() have more chance to fill
> complete packets.
> 
> Again, for this to work very well, you need that NIC performs TX
> completion in reasonable amount of time...

Eric, first I would like to confirm that I could reproduce Arnaud's issue
using 3.10.19 (160 kB/s in the worst case).

Second, I confirm that your patch partially fixes it and my performance
can be brought back to what I had with 3.10-rc7, but with a lot of
concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
applied, performance is still only 27 MB/s at 7 concurrent streams, and I
need at least 35 concurrent streams to fill the pipe. Strangely, after
2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
fell to 10 MB/s again.

If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
your latest patch, the performance is back to original.

Now I understand there's a major issue with the driver. But since the
patch emphasizes the situations where drivers take a lot of time to
wake the queue up, don't you think there could be an issue with low
bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
I'm a bit worried about what we might discover in this area I must
confess (despite generally being mostly focused on 10+ Gbps).

Best regards,
Willy
Eric Dumazet Nov. 20, 2013, 5:30 p.m. UTC | #13
On Wed, 2013-11-20 at 18:12 +0100, Willy Tarreau wrote:
> Hi guys,

> Eric, first I would like to confirm that I could reproduce Arnaud's issue
> using 3.10.19 (160 kB/s in the worst case).
> 
> Second, I confirm that your patch partially fixes it and my performance
> can be brought back to what I had with 3.10-rc7, but with a lot of
> concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
> the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
> applied, performance is still only 27 MB/s at 7 concurrent streams, and I
> need at least 35 concurrent streams to fill the pipe. Strangely, after
> 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
> fell to 10 MB/s again.
> 
> If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
> your latest patch, the performance is back to original.
> 
> Now I understand there's a major issue with the driver. But since the
> patch emphasizes the situations where drivers take a lot of time to
> wake the queue up, don't you think there could be an issue with low
> bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
> I'm a bit worried about what we might discover in this area I must
> confess (despite generally being mostly focused on 10+ Gbps).

Well, all TCP performance results are highly dependent on the workload,
and both receivers and senders behavior.

We made many improvements like TSO auto sizing, DRS (dynamic Right
Sizing), and if the application used some specific settings (like
SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
exact performance is reached from kernel version X to kernel version Y.

We try to make forward progress, there is little gain to revert all
these great works. Linux had this tendency to favor throughput by using
overly large skbs. Its time to do better.

As explained, some drivers are buggy, and need fixes.

If nobody wants to fix them, this really means no one is interested
getting them fixed.

I am willing to help if you provide details, because otherwise I need
a crystal ball ;)

One known problem of TCP is the fact that an incoming ACK making room in
socket write queue immediately wakeup a blocked thread (POLLOUT), even
if only one MSS was ack, and write queue has 2MB of outstanding bytes.

All these scheduling problems should be identified and fixed, and yes,
this will require a dozen more patches.

max (128KB , 1-2 ms) of buffering per flow should be enough to reach
line rate, even for a single flow, but this means the sk_sndbuf value
for the socket must take into account the pipe size _plus_ 1ms of
buffering.
Willy Tarreau Nov. 20, 2013, 5:34 p.m. UTC | #14
On Wed, Nov 20, 2013 at 06:12:27PM +0100, Willy Tarreau wrote:
> Hi guys,
> 
> On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote:
> > On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote:
> > 
> > > 
> > > So it is fairly possible that in your case you can't fill the link if you
> > > consume too many descriptors. For example, if your server uses TCP_NODELAY
> > > and sends incomplete segments (which is quite common), it's very easy to
> > > run out of descriptors before the link is full.
> > 
> > BTW I have a very simple patch for TCP stack that could help this exact
> > situation...
> > 
> > Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
> > very small frames, and let tcp_sendmsg() have more chance to fill
> > complete packets.
> > 
> > Again, for this to work very well, you need that NIC performs TX
> > completion in reasonable amount of time...
> 
> Eric, first I would like to confirm that I could reproduce Arnaud's issue
> using 3.10.19 (160 kB/s in the worst case).
> 
> Second, I confirm that your patch partially fixes it and my performance
> can be brought back to what I had with 3.10-rc7, but with a lot of
> concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
> the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
> applied, performance is still only 27 MB/s at 7 concurrent streams, and I
> need at least 35 concurrent streams to fill the pipe. Strangely, after
> 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
> fell to 10 MB/s again.
> 
> If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
> your latest patch, the performance is back to original.
> 
> Now I understand there's a major issue with the driver. But since the
> patch emphasizes the situations where drivers take a lot of time to
> wake the queue up, don't you think there could be an issue with low
> bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
> I'm a bit worried about what we might discover in this area I must
> confess (despite generally being mostly focused on 10+ Gbps).

One important point, I was looking for the other patch you pointed
in this long thread and finally found it :

> So
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> 
> restored this minimal amount of buffering, and let the bigger amount for
> 40Gb NICs ;)

This one definitely restores original performance, so it's a much better
bet in my opinion :-)

Best regards,
Willy
Willy Tarreau Nov. 20, 2013, 5:38 p.m. UTC | #15
On Wed, Nov 20, 2013 at 09:30:07AM -0800, Eric Dumazet wrote:
> Well, all TCP performance results are highly dependent on the workload,
> and both receivers and senders behavior.
> 
> We made many improvements like TSO auto sizing, DRS (dynamic Right
> Sizing), and if the application used some specific settings (like
> SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
> exact performance is reached from kernel version X to kernel version Y.

Of course, which is why I only care when there's a significant
difference. If I need 6 streams in a version and 8 in another one to
fill the wire, I call them identical. It's only when we dig into the
details that we analyse the differences.

> We try to make forward progress, there is little gain to revert all
> these great works. Linux had this tendency to favor throughput by using
> overly large skbs. Its time to do better.

I agree. Unfortunately our mails have crossed each other, so just to
keep this tread mostly linear, your next patch here :

   http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

Fixes that regression and the performance is back to normal which is
good.

> As explained, some drivers are buggy, and need fixes.

Agreed!

> If nobody wants to fix them, this really means no one is interested
> getting them fixed.

I was exactly reading the code when I found a window with your patch
above that I was looking for :-)

> I am willing to help if you provide details, because otherwise I need
> a crystal ball ;)
> 
> One known problem of TCP is the fact that an incoming ACK making room in
> socket write queue immediately wakeup a blocked thread (POLLOUT), even
> if only one MSS was ack, and write queue has 2MB of outstanding bytes.

Indeed.

> All these scheduling problems should be identified and fixed, and yes,
> this will require a dozen more patches.
> 
> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

Which is the purpose of your patch above and I confirm it fixes the
problem.

Now looking at how to workaround this lack of Tx IRQ.

Thanks!
Willy
Eric Dumazet Nov. 20, 2013, 5:40 p.m. UTC | #16
On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:

> One important point, I was looking for the other patch you pointed
> in this long thread and finally found it :
> 
> > So
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> > 
> > restored this minimal amount of buffering, and let the bigger amount for
> > 40Gb NICs ;)
> 
> This one definitely restores original performance, so it's a much better
> bet in my opinion :-)

I don't understand. I thought you were using this patch.

I guess we are spending time on a already solved problem.
Willy Tarreau Nov. 20, 2013, 6:15 p.m. UTC | #17
On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:
> 
> > One important point, I was looking for the other patch you pointed
> > in this long thread and finally found it :
> > 
> > > So
> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
> > > 
> > > restored this minimal amount of buffering, and let the bigger amount for
> > > 40Gb NICs ;)
> > 
> > This one definitely restores original performance, so it's a much better
> > bet in my opinion :-)
> 
> I don't understand. I thought you were using this patch.

No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.

> I guess we are spending time on a already solved problem.

That's possible indeed. Sorry if I was not clear enough, I tried.

Regards,
Willy
Eric Dumazet Nov. 20, 2013, 6:21 p.m. UTC | #18
On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote:

> No, I was on latest stable (3.10.19) which exhibits the regression but does
> not yet have your fix above. Then I tested the patch your proposed in this
> thread, then this latest one. Since the patch is not yet even in Linus'
> tree, I'm not sure Arnaud has tried it yet.

Oh right ;)

BTW Linus tree has the fix, I just checked this now.

(Linus got David changes 19 hours ago)
Willy Tarreau Nov. 20, 2013, 6:29 p.m. UTC | #19
On Wed, Nov 20, 2013 at 10:21:24AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote:
> 
> > No, I was on latest stable (3.10.19) which exhibits the regression but does
> > not yet have your fix above. Then I tested the patch your proposed in this
> > thread, then this latest one. Since the patch is not yet even in Linus'
> > tree, I'm not sure Arnaud has tried it yet.
> 
> Oh right ;)
> 
> BTW Linus tree has the fix, I just checked this now.
> 
> (Linus got David changes 19 hours ago)

Ah yes I didn't look far enough back.

Thanks,
Willy
David Miller Nov. 20, 2013, 6:52 p.m. UTC | #20
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Nov 2013 09:30:07 -0800

> max (128KB , 1-2 ms) of buffering per flow should be enough to reach
> line rate, even for a single flow, but this means the sk_sndbuf value
> for the socket must take into account the pipe size _plus_ 1ms of
> buffering.

And we can implement this using the estimated pacing rate.
Arnaud Ebalard Nov. 20, 2013, 7:21 p.m. UTC | #21
Hi Thomas,

I removed netdev from that reply

arno@natisbad.org (Arnaud Ebalard) writes:

> <#secure method=pgpmime mode=sign>
> Thomas Petazzoni <thomas.petazzoni@free-electrons.com> writes:
>
>> Arnaud,
>>
>> On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote:
>>
>>>  - It seems I will I have to spend some time on the SATA issues I
>>>    previously thought were an artefact of not cleaning my tree during a
>>>    debug session [1], i.e. there is IMHO an issue.
>>
>> I don't remember in detail what your SATA problem was, but just to let
>> you know that we are currently debugging a problem that occurs on
>> Armada XP (more than one core is needed for the problem to occur), with
>> SATA (the symptom is that after some time of SATA usage, SATA traffic is
>> stalled, and no SATA interrupts are generated anymore). We're still
>> working on this one, and trying to figure out where the problem is.

The problem I had is in the first email of:

  http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508

Then, yesterday, when testing with the USB 3.0 to ethernet dongle
connected to my RN102 (Armada 370) as primary interface, I got the
following. It is easily reproducible:

[  317.412873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[  317.419947] ata1.00: failed command: READ FPDMA QUEUED
[  317.425118] ata1.00: cmd 60/00:00:00:07:2a/01:00:00:00:00/40 tag 0 ncq 131072 in
[  317.425118]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  317.439926] ata1.00: status: { DRDY }
[  317.443600] ata1.00: failed command: READ FPDMA QUEUED
[  317.448756] ata1.00: cmd 60/00:08:00:08:2a/01:00:00:00:00/40 tag 1 ncq 131072 in
[  317.448756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  317.463565] ata1.00: status: { DRDY }
[  317.467244] ata1: hard resetting link
[  318.012913] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  318.020220] ata1.00: configured for UDMA/133
[  318.024559] ata1.00: device reported invalid CHS sector 0
[  318.030001] ata1.00: device reported invalid CHS sector 0
[  318.035425] ata1: EH complete

And then again:

[  381.342873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[  381.349947] ata1.00: failed command: READ FPDMA QUEUED
[  381.355119] ata1.00: cmd 60/00:00:00:03:30/01:00:00:00:00/40 tag 0 ncq 131072 in
[  381.355119]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  381.369927] ata1.00: status: { DRDY }
[  381.373599] ata1.00: failed command: READ FPDMA QUEUED
[  381.378756] ata1.00: cmd 60/00:08:00:04:30/01:00:00:00:00/40 tag 1 ncq 131072 in
[  381.378756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  381.393563] ata1.00: status: { DRDY }
[  381.397242] ata1: hard resetting link
[  381.942848] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  381.950167] ata1.00: configured for UDMA/133
[  381.954496] ata1.00: device reported invalid CHS sector 0
[  381.959958] ata1.00: device reported invalid CHS sector 0
[  381.965383] ata1: EH complete

But, as the problem seems to happen when the dongle is connected and in
use (and considering current threads on the topic on USB and netdev ML),
I think I will wait for things to calm down and test again with a 3.12
and then a 3.13-rcX.

Anyway, if you find something on your bug, I can give patches a try on
my RN2120.

Cheers,

a+
Arnaud Ebalard Nov. 20, 2013, 7:22 p.m. UTC | #22
Hi,

Willy Tarreau <w@1wt.eu> writes:

> On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote:
>> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote:
>> 
>> > One important point, I was looking for the other patch you pointed
>> > in this long thread and finally found it :
>> > 
>> > > So
>> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
>> > > 
>> > > restored this minimal amount of buffering, and let the bigger amount for
>> > > 40Gb NICs ;)
>> > 
>> > This one definitely restores original performance, so it's a much better
>> > bet in my opinion :-)
>> 
>> I don't understand. I thought you were using this patch.
>
> No, I was on latest stable (3.10.19) which exhibits the regression but does
> not yet have your fix above. Then I tested the patch your proposed in this
> thread, then this latest one. Since the patch is not yet even in Linus'
> tree, I'm not sure Arnaud has tried it yet.

This is the one I have used for a week now, since its publication by
Eric a week ago:

 http://www.spinics.net/lists/netdev/msg257396.html

Cheers,

a+
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3dc0c6c..10456cf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -624,13 +624,19 @@  static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 {
 	if (tcp_send_head(sk)) {
 		struct tcp_sock *tp = tcp_sk(sk);
+		struct sk_buff *skb = tcp_write_queue_tail(sk);
 
 		if (!(flags & MSG_MORE) || forced_push(tp))
-			tcp_mark_push(tp, tcp_write_queue_tail(sk));
+			tcp_mark_push(tp, skb);
 
 		tcp_mark_urg(tp, flags);
-		__tcp_push_pending_frames(sk, mss_now,
-					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+		if (flags & MSG_MORE)
+			nonagle = TCP_NAGLE_CORK;
+		if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
+			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+			nonagle = TCP_NAGLE_CORK;
+		}
+		__tcp_push_pending_frames(sk, mss_now, nonagle);
 	}
 }