Message ID | 1384710098.8604.58.camel@edumazet-glaptop2.roam.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, Eric Dumazet <eric.dumazet@gmail.com> writes: > On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote: > >> >> So it is fairly possible that in your case you can't fill the link if you >> consume too many descriptors. For example, if your server uses TCP_NODELAY >> and sends incomplete segments (which is quite common), it's very easy to >> run out of descriptors before the link is full. > > BTW I have a very simple patch for TCP stack that could help this exact > situation... > > Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with > very small frames, and let tcp_sendmsg() have more chance to fill > complete packets. > > Again, for this to work very well, you need that NIC performs TX > completion in reasonable amount of time... > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 3dc0c6c..10456cf 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now, > { > if (tcp_send_head(sk)) { > struct tcp_sock *tp = tcp_sk(sk); > + struct sk_buff *skb = tcp_write_queue_tail(sk); > > if (!(flags & MSG_MORE) || forced_push(tp)) > - tcp_mark_push(tp, tcp_write_queue_tail(sk)); > + tcp_mark_push(tp, skb); > > tcp_mark_urg(tp, flags); > - __tcp_push_pending_frames(sk, mss_now, > - (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle); > + if (flags & MSG_MORE) > + nonagle = TCP_NAGLE_CORK; > + if (atomic_read(&sk->sk_wmem_alloc) > 2048) { > + set_bit(TSQ_THROTTLED, &tp->tsq_flags); > + nonagle = TCP_NAGLE_CORK; > + } > + __tcp_push_pending_frames(sk, mss_now, nonagle); > } > } I did some test regarding mvneta perf on current linus tree (commit 2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit" reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk: armada-370: fix tclk frequencies"). Kernel has some debug options enabled and the patch above is not applied. I will spend some time on this two directions this evening. The idea was to get some numbers on the impact of TCP send window size and tcp_limit_output_bytes for mvneta. The test is done with a laptop (Debian, 3.11.0, e1000e) directly connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102 is running Debian armhf with an Apache2 serving a 1GB file from ext4 over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e. a simple wget w/ -O /dev/null option. With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282 @1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s (cannot remember the kernel version but sth between 3.8 and 3.10. So with that setup: w/ TCP send window set to 4MB: 17.4 MB/s w/ TCP send window set to 2MB: 16.2 MB/s w/ TCP send window set to 1MB: 15.6 MB/s w/ TCP send window set to 512KB: 25.6 MB/s w/ TCP send window set to 256KB: 57.7 MB/s w/ TCP send window set to 128KB: 54.0 MB/s w/ TCP send window set to 64KB: 46.2 MB/s w/ TCP send window set to 32KB: 42.8 MB/s Then, I started playing w/ tcp_limit_output_bytes (default is 131072), w/ TCP send window set to 256KB: tcp_limit_output_bytes set to 512KB: 59.3 MB/s tcp_limit_output_bytes set to 256KB: 58.5 MB/s tcp_limit_output_bytes set to 128KB: 56.2 MB/s tcp_limit_output_bytes set to 64KB: 32.1 MB/s tcp_limit_output_bytes set to 32KB: 4.76 MB/s As a side note, during the test, I sometimes gets peak for some seconds at the beginning at 90MB/s which tend to confirm what WIlly wrote, i.e. that the hardware can do more. Cheers, a+
On Tue, 2013-11-19 at 07:44 +0100, Arnaud Ebalard wrote: > I did some test regarding mvneta perf on current linus tree (commit > 2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit" > reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk: > armada-370: fix tclk frequencies"). Kernel has some debug options > enabled and the patch above is not applied. I will spend some time on > this two directions this evening. The idea was to get some numbers on > the impact of TCP send window size and tcp_limit_output_bytes for > mvneta. Note the last patch I sent was not relevant to your problem, do not bother trying it. Its useful for applications doing lot of consecutive short writes, like interactive ssh launching line buffered commands. > > > The test is done with a laptop (Debian, 3.11.0, e1000e) directly > connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102 > is running Debian armhf with an Apache2 serving a 1GB file from ext4 > over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e. > a simple wget w/ -O /dev/null option. > > With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282 > @1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s > (cannot remember the kernel version but sth between 3.8 and 3.10. > > So with that setup: > > w/ TCP send window set to 4MB: 17.4 MB/s > w/ TCP send window set to 2MB: 16.2 MB/s > w/ TCP send window set to 1MB: 15.6 MB/s > w/ TCP send window set to 512KB: 25.6 MB/s > w/ TCP send window set to 256KB: 57.7 MB/s > w/ TCP send window set to 128KB: 54.0 MB/s > w/ TCP send window set to 64KB: 46.2 MB/s > w/ TCP send window set to 32KB: 42.8 MB/s One of the problem is that tcp_sendmsg() holds the socket lock for the whole duration of the system call if it has not to sleep. This model doesnt allow for incoming ACKS to be processed (they are put in socket backlog and will be processed at socket release time), and TX completion to also queue the next chunk. These strange results you have tend to show that if you have a big TCP send window, the web server pushes a lot of bytes per system call and might stall the ACK clocking or TX refills. > > Then, I started playing w/ tcp_limit_output_bytes (default is 131072), > w/ TCP send window set to 256KB: > > tcp_limit_output_bytes set to 512KB: 59.3 MB/s > tcp_limit_output_bytes set to 256KB: 58.5 MB/s > tcp_limit_output_bytes set to 128KB: 56.2 MB/s > tcp_limit_output_bytes set to 64KB: 32.1 MB/s > tcp_limit_output_bytes set to 32KB: 4.76 MB/s > > As a side note, during the test, I sometimes gets peak for some seconds > at the beginning at 90MB/s which tend to confirm what WIlly wrote, > i.e. that the hardware can do more. I would also check the receiver. I suspect packets drops because of a bad driver doing skb->truesize overshooting. nstat >/dev/null ; wget .... ; nstat
Hi Eric, On Tue, Nov 19, 2013 at 05:53:14AM -0800, Eric Dumazet wrote: > These strange results you have tend to show that if you have a big TCP > send window, the web server pushes a lot of bytes per system call and > might stall the ACK clocking or TX refills. It's tx refills which are not done in this case from what I think I understood in the driver. IIRC, the refill is done once at the beginning of xmit and in the tx timer callback. So if you have too large a window that fills the descriptors during a few tx calls during which no desc was released, you could end up having to wait for the timer since you're not allowed to send anymore. > > Then, I started playing w/ tcp_limit_output_bytes (default is 131072), > > w/ TCP send window set to 256KB: > > > > tcp_limit_output_bytes set to 512KB: 59.3 MB/s > > tcp_limit_output_bytes set to 256KB: 58.5 MB/s > > tcp_limit_output_bytes set to 128KB: 56.2 MB/s > > tcp_limit_output_bytes set to 64KB: 32.1 MB/s > > tcp_limit_output_bytes set to 32KB: 4.76 MB/s > > > > As a side note, during the test, I sometimes gets peak for some seconds > > at the beginning at 90MB/s which tend to confirm what WIlly wrote, > > i.e. that the hardware can do more. > > I would also check the receiver. I suspect packets drops because of a > bad driver doing skb->truesize overshooting. When I first observed the issue, at first I suspected my laptop's driver when I saw this problem, so I tried with a dockstar instead and the issue disappeared... until I increased the tcp_rmem on it to match my laptop :-) Arnaud, you might be interested in trying checking if the following change does something for you in mvneta.c : - #define MVNETA_TX_DONE_TIMER_PERIOD 10 + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ) This can only have any effect if you run at 250 or 1000 Hz, but not at 100 of course. It should reduce the time to first IRQ. Willy
On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote: > - #define MVNETA_TX_DONE_TIMER_PERIOD 10 > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ) > I suggested this in a prior mail : #define MVNETA_TX_DONE_TIMER_PERIOD 1 But apparently it was triggering strange crashes... > This can only have any effect if you run at 250 or 1000 Hz, but not at 100 > of course. It should reduce the time to first IRQ.
On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote: > On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote: > > > - #define MVNETA_TX_DONE_TIMER_PERIOD 10 > > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ) > > > > I suggested this in a prior mail : > > #define MVNETA_TX_DONE_TIMER_PERIOD 1 Ah sorry, I remember now. > But apparently it was triggering strange crashes... Ah, when a bug hides another one, it's the situation I prefer, because by working on one, you end up fixing two :-) Cheers, Willy
Hi, Willy Tarreau <w@1wt.eu> writes: > On Tue, Nov 19, 2013 at 10:31:50AM -0800, Eric Dumazet wrote: >> On Tue, 2013-11-19 at 18:43 +0100, Willy Tarreau wrote: >> >> > - #define MVNETA_TX_DONE_TIMER_PERIOD 10 >> > + #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ) >> > >> >> I suggested this in a prior mail : >> >> #define MVNETA_TX_DONE_TIMER_PERIOD 1 > > Ah sorry, I remember now. > >> But apparently it was triggering strange crashes... > > Ah, when a bug hides another one, it's the situation I prefer, because > by working on one, you end up fixing two :-) Follow me just for one sec: today, I got a USB 3.0 Gigabit Ethernet adapter. More specifically an AX88179-based one (Logitec LAN-GTJU3H3), about which there is currently a thread on netdev and linux-usb lists. Anyway, I decided to give it a try on my RN102 just to check what performance I could achieve. So I basically did the same experiment as yesterday (wget on client against a 1GB file located on the filesystem served by an apache on the NAS) except that time the AX88179-based adapater was used instead of the mvneta-based interface. Well, the download started at a high rate (90MB/s) but then drops and I get some SATA error on the NAS (similar to the errors I already got during 12.0-rc series [1] to finally *erroneously* consider it was an artefact). So I decided to remove the SATA controllers and disks from the equation: I switched to my ReadyNAS 2120 whose GbE interfaces are also based on mvneta driver and comes w/ 2GB of RAM. The main additional difference is that the device is a dual core armada @1.2GHz, where the RN102 is a single core armada @1.2GHz. I created a dummy 1GB file *in RAM* (/run/shm) to have it served by the apache2 instead of the file previously stored on the disks. I started w/ todays linus tree (dec8e46178b) with Eric's revert patch for c9eeec26e32e (tcp: TSQ can use a dynamic limit) and also the change to mvneta driver to have: -#define MVNETA_TX_DONE_TIMER_PERIOD 10 +#define MVNETA_TX_DONE_TIMER_PERIOD 1 Here are the average speed given by wget for the following TCP send window: 4 MB: 19 MB/s 2 MB: 21 MB/s 1 MB: 21 MB/s 512KB: 23 MB/s 384KB: 105 MB/s 256KB: 112 MB/s 128KB: 111 MB/s 64KB: 93 MB/s Then, I decided to redo the exact same test w/o the change on MVNETA_TX_DONE_TIMER_PERIOD (i.e. w/ the initial value of 10). I get the exact same results as with the MVNETA_TX_DONE_TIMER_PERIOD set to 1, i.e: 4 MB: 20 MB/s 2 MB: 21 MB/s 1 MB: 21 MB/s 512KB: 22 MB/s 384KB: 105 MB/s 256KB: 112 MB/s 128KB: 111 MB/s 64KB: 93 MB/s And, then, I also dropped Eric's revert patch for c9eeec26e32e (tcp: TSQ can use a dynamic limit), just to verify we came back where the thread started but i got a surprise: 4 MB: 10 MB/s 2 MB: 11 MB/s 1 MB: 10 MB/s 512KB: 12 MB/s 384KB: 104 MB/s 256KB: 112 MB/s 128KB: 112 MB/s 64KB: 93 MB/s Instead of the 256KB/s I had observed the low value was now 10MB/s. I thought it was due to the switch from RN102 to RN2120 so I came back to the RN102 w/o any specific patch for mvneta nor your revert patch for c9eeec26e32e, i.e. only Linus tree as it is today (dec8e46178b). The file is served from the disk: 4 MB: 5 MB/s 2 MB: 5 MB/s 1 MB: 5 MB/s 512KB: 5 MB/s 384KB: 90 MB/s for 4s, then 3 MB/s 256KB: 80 MB/s for 3s, then 2 MB/s 128KB: 90 MB/s for 3s, then 3 MB/s 64KB: 80 MB/s for 3s, then 3 MB/S Then, I allocated a dummy 400MB file in RAM (/run/shm) and redid the test on the RN102: 4 MB: 8 MB/s 2 MB: 8 MB/s 1 MB: 92 MB/s 512KB: 90 MB/s 384KB: 90 MB/s 256KB: 90 MB/s 128KB: 90 MB/s 64KB: 60 MB/s In the end, here are the conclusions *I* draw from this test session, do not hesitate to correct me: - Eric, it seems something changed in linus tree betwen the beginning of the thread and now, which somehow reduces the effect of the regression we were seen: I never got back the 256KB/s. - You revert patch still improves the perf a lot - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help - w/ your revert patch, I can confirm that mvneta driver is capable of doing line rate w/ proper tweak of TCP send window (256KB instead of 4M) - It seems I will I have to spend some time on the SATA issues I previously thought were an artefact of not cleaning my tree during a debug session [1], i.e. there is IMHO an issue. What I do not get is what can cause the perf to drop from 90MB/s to 3MB/s (w/ a 256KB send window) when streaming from the disk instead of the RAM. I have no issue having dd read from the fs @ 150MB/s and mvneta streaming from RAM @ 90MB/s but both together get me 3MB/s after a few seconds. Anyway, I think if the thread keeps going on improving mvneta, I'll do all additional tests from RAM and will stop polluting netdev w/ possible sata/disk/fs issues. Cheers, a+ [1]: http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508
On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote: > Anyway, I think if the thread keeps going on improving mvneta, I'll do > all additional tests from RAM and will stop polluting netdev w/ possible > sata/disk/fs issues. ;) Alternative would be to use netperf or iperf to not use disk at all and focus on TCP/network issues only. Thanks !
On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote: > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote: > > > Anyway, I think if the thread keeps going on improving mvneta, I'll do > > all additional tests from RAM and will stop polluting netdev w/ possible > > sata/disk/fs issues. > > ;) > > Alternative would be to use netperf or iperf to not use disk at all > and focus on TCP/network issues only. Yes, that's for the same reason that I continue to use inject/httpterm for such purposes : - httpterm uses tee()+splice() to send pre-built pages without copying ; - inject uses recv(MSG_TRUNC) to ack everything without copying. Both of them are really interesting to test the hardware's capabilities and to push components in the middle to their limits without causing too much burden to the end points. I don't know if either netperf or iperf can make use of this now, and I've been used to my tools, but I should take a look again. Cheers, Willy
On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote: > On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote: > > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote: > > > > > Anyway, I think if the thread keeps going on improving mvneta, I'll do > > > all additional tests from RAM and will stop polluting netdev w/ possible > > > sata/disk/fs issues. > > > > ;) > > > > Alternative would be to use netperf or iperf to not use disk at all > > and focus on TCP/network issues only. > > Yes, that's for the same reason that I continue to use inject/httpterm > for such purposes : > - httpterm uses tee()+splice() to send pre-built pages without copying ; > - inject uses recv(MSG_TRUNC) to ack everything without copying. > > Both of them are really interesting to test the hardware's capabilities > and to push components in the middle to their limits without causing too > much burden to the end points. > > I don't know if either netperf or iperf can make use of this now, and > I've been used to my tools, but I should take a look again. netperf -t TCP_SENDFILE does the zero copy at sender. And more generally -V option does copy avoidance (Use splice(sockfd -> nullfd) at receiver if I remember well) Anyway, we should do the normal copy, because it might demonstrate scheduling problems. If you want to test raw speed, you could use pktgen ;)
On Tue, Nov 19, 2013 at 04:43:48PM -0800, Eric Dumazet wrote: > On Wed, 2013-11-20 at 01:35 +0100, Willy Tarreau wrote: > > On Tue, Nov 19, 2013 at 04:08:49PM -0800, Eric Dumazet wrote: > > > On Wed, 2013-11-20 at 00:53 +0100, Arnaud Ebalard wrote: > > > > > > > Anyway, I think if the thread keeps going on improving mvneta, I'll do > > > > all additional tests from RAM and will stop polluting netdev w/ possible > > > > sata/disk/fs issues. > > > > > > ;) > > > > > > Alternative would be to use netperf or iperf to not use disk at all > > > and focus on TCP/network issues only. > > > > Yes, that's for the same reason that I continue to use inject/httpterm > > for such purposes : > > - httpterm uses tee()+splice() to send pre-built pages without copying ; > > - inject uses recv(MSG_TRUNC) to ack everything without copying. > > > > Both of them are really interesting to test the hardware's capabilities > > and to push components in the middle to their limits without causing too > > much burden to the end points. > > > > I don't know if either netperf or iperf can make use of this now, and > > I've been used to my tools, but I should take a look again. > > netperf -t TCP_SENDFILE does the zero copy at sender. > > And more generally -V option does copy avoidance > > (Use splice(sockfd -> nullfd) at receiver if I remember well) OK thanks for the info. > Anyway, we should do the normal copy, because it might demonstrate > scheduling problems. Yes, especially in this case, though I got the issue with GSO only, so it might vary as well. > If you want to test raw speed, you could use pktgen ;) Except I'm mostly focused on HTTP as you know. And for generating higher packet rates than pktgen, I have an absolutely ugly patch that I'm a bit ashamed of for mvneta to multiply the number of emitted descriptors for a given skb by skb->sk->sk_mark (which I set using setsockopt), this allows me to generate up to 1.488 Mpps on a USB-powered system, not that bad in my opinion :-) Willy
Arnaud, On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote: > - It seems I will I have to spend some time on the SATA issues I > previously thought were an artefact of not cleaning my tree during a > debug session [1], i.e. there is IMHO an issue. I don't remember in detail what your SATA problem was, but just to let you know that we are currently debugging a problem that occurs on Armada XP (more than one core is needed for the problem to occur), with SATA (the symptom is that after some time of SATA usage, SATA traffic is stalled, and no SATA interrupts are generated anymore). We're still working on this one, and trying to figure out where the problem is. Best regards, Thomas
Hi guys, On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote: > On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote: > > > > > So it is fairly possible that in your case you can't fill the link if you > > consume too many descriptors. For example, if your server uses TCP_NODELAY > > and sends incomplete segments (which is quite common), it's very easy to > > run out of descriptors before the link is full. > > BTW I have a very simple patch for TCP stack that could help this exact > situation... > > Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with > very small frames, and let tcp_sendmsg() have more chance to fill > complete packets. > > Again, for this to work very well, you need that NIC performs TX > completion in reasonable amount of time... Eric, first I would like to confirm that I could reproduce Arnaud's issue using 3.10.19 (160 kB/s in the worst case). Second, I confirm that your patch partially fixes it and my performance can be brought back to what I had with 3.10-rc7, but with a lot of concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch applied, performance is still only 27 MB/s at 7 concurrent streams, and I need at least 35 concurrent streams to fill the pipe. Strangely, after 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and fell to 10 MB/s again. If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and your latest patch, the performance is back to original. Now I understand there's a major issue with the driver. But since the patch emphasizes the situations where drivers take a lot of time to wake the queue up, don't you think there could be an issue with low bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ? I'm a bit worried about what we might discover in this area I must confess (despite generally being mostly focused on 10+ Gbps). Best regards, Willy
On Wed, 2013-11-20 at 18:12 +0100, Willy Tarreau wrote: > Hi guys, > Eric, first I would like to confirm that I could reproduce Arnaud's issue > using 3.10.19 (160 kB/s in the worst case). > > Second, I confirm that your patch partially fixes it and my performance > can be brought back to what I had with 3.10-rc7, but with a lot of > concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate > the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch > applied, performance is still only 27 MB/s at 7 concurrent streams, and I > need at least 35 concurrent streams to fill the pipe. Strangely, after > 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and > fell to 10 MB/s again. > > If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and > your latest patch, the performance is back to original. > > Now I understand there's a major issue with the driver. But since the > patch emphasizes the situations where drivers take a lot of time to > wake the queue up, don't you think there could be an issue with low > bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ? > I'm a bit worried about what we might discover in this area I must > confess (despite generally being mostly focused on 10+ Gbps). Well, all TCP performance results are highly dependent on the workload, and both receivers and senders behavior. We made many improvements like TSO auto sizing, DRS (dynamic Right Sizing), and if the application used some specific settings (like SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same exact performance is reached from kernel version X to kernel version Y. We try to make forward progress, there is little gain to revert all these great works. Linux had this tendency to favor throughput by using overly large skbs. Its time to do better. As explained, some drivers are buggy, and need fixes. If nobody wants to fix them, this really means no one is interested getting them fixed. I am willing to help if you provide details, because otherwise I need a crystal ball ;) One known problem of TCP is the fact that an incoming ACK making room in socket write queue immediately wakeup a blocked thread (POLLOUT), even if only one MSS was ack, and write queue has 2MB of outstanding bytes. All these scheduling problems should be identified and fixed, and yes, this will require a dozen more patches. max (128KB , 1-2 ms) of buffering per flow should be enough to reach line rate, even for a single flow, but this means the sk_sndbuf value for the socket must take into account the pipe size _plus_ 1ms of buffering.
On Wed, Nov 20, 2013 at 06:12:27PM +0100, Willy Tarreau wrote: > Hi guys, > > On Sun, Nov 17, 2013 at 09:41:38AM -0800, Eric Dumazet wrote: > > On Sun, 2013-11-17 at 15:19 +0100, Willy Tarreau wrote: > > > > > > > > So it is fairly possible that in your case you can't fill the link if you > > > consume too many descriptors. For example, if your server uses TCP_NODELAY > > > and sends incomplete segments (which is quite common), it's very easy to > > > run out of descriptors before the link is full. > > > > BTW I have a very simple patch for TCP stack that could help this exact > > situation... > > > > Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with > > very small frames, and let tcp_sendmsg() have more chance to fill > > complete packets. > > > > Again, for this to work very well, you need that NIC performs TX > > completion in reasonable amount of time... > > Eric, first I would like to confirm that I could reproduce Arnaud's issue > using 3.10.19 (160 kB/s in the worst case). > > Second, I confirm that your patch partially fixes it and my performance > can be brought back to what I had with 3.10-rc7, but with a lot of > concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate > the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch > applied, performance is still only 27 MB/s at 7 concurrent streams, and I > need at least 35 concurrent streams to fill the pipe. Strangely, after > 2 GB of cumulated data transferred, the bandwidth divided by 11-fold and > fell to 10 MB/s again. > > If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and > your latest patch, the performance is back to original. > > Now I understand there's a major issue with the driver. But since the > patch emphasizes the situations where drivers take a lot of time to > wake the queue up, don't you think there could be an issue with low > bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ? > I'm a bit worried about what we might discover in this area I must > confess (despite generally being mostly focused on 10+ Gbps). One important point, I was looking for the other patch you pointed in this long thread and finally found it : > So > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee > > restored this minimal amount of buffering, and let the bigger amount for > 40Gb NICs ;) This one definitely restores original performance, so it's a much better bet in my opinion :-) Best regards, Willy
On Wed, Nov 20, 2013 at 09:30:07AM -0800, Eric Dumazet wrote: > Well, all TCP performance results are highly dependent on the workload, > and both receivers and senders behavior. > > We made many improvements like TSO auto sizing, DRS (dynamic Right > Sizing), and if the application used some specific settings (like > SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same > exact performance is reached from kernel version X to kernel version Y. Of course, which is why I only care when there's a significant difference. If I need 6 streams in a version and 8 in another one to fill the wire, I call them identical. It's only when we dig into the details that we analyse the differences. > We try to make forward progress, there is little gain to revert all > these great works. Linux had this tendency to favor throughput by using > overly large skbs. Its time to do better. I agree. Unfortunately our mails have crossed each other, so just to keep this tread mostly linear, your next patch here : http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee Fixes that regression and the performance is back to normal which is good. > As explained, some drivers are buggy, and need fixes. Agreed! > If nobody wants to fix them, this really means no one is interested > getting them fixed. I was exactly reading the code when I found a window with your patch above that I was looking for :-) > I am willing to help if you provide details, because otherwise I need > a crystal ball ;) > > One known problem of TCP is the fact that an incoming ACK making room in > socket write queue immediately wakeup a blocked thread (POLLOUT), even > if only one MSS was ack, and write queue has 2MB of outstanding bytes. Indeed. > All these scheduling problems should be identified and fixed, and yes, > this will require a dozen more patches. > > max (128KB , 1-2 ms) of buffering per flow should be enough to reach > line rate, even for a single flow, but this means the sk_sndbuf value > for the socket must take into account the pipe size _plus_ 1ms of > buffering. Which is the purpose of your patch above and I confirm it fixes the problem. Now looking at how to workaround this lack of Tx IRQ. Thanks! Willy
On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote: > One important point, I was looking for the other patch you pointed > in this long thread and finally found it : > > > So > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee > > > > restored this minimal amount of buffering, and let the bigger amount for > > 40Gb NICs ;) > > This one definitely restores original performance, so it's a much better > bet in my opinion :-) I don't understand. I thought you were using this patch. I guess we are spending time on a already solved problem.
On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote: > On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote: > > > One important point, I was looking for the other patch you pointed > > in this long thread and finally found it : > > > > > So > > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee > > > > > > restored this minimal amount of buffering, and let the bigger amount for > > > 40Gb NICs ;) > > > > This one definitely restores original performance, so it's a much better > > bet in my opinion :-) > > I don't understand. I thought you were using this patch. No, I was on latest stable (3.10.19) which exhibits the regression but does not yet have your fix above. Then I tested the patch your proposed in this thread, then this latest one. Since the patch is not yet even in Linus' tree, I'm not sure Arnaud has tried it yet. > I guess we are spending time on a already solved problem. That's possible indeed. Sorry if I was not clear enough, I tried. Regards, Willy
On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote: > No, I was on latest stable (3.10.19) which exhibits the regression but does > not yet have your fix above. Then I tested the patch your proposed in this > thread, then this latest one. Since the patch is not yet even in Linus' > tree, I'm not sure Arnaud has tried it yet. Oh right ;) BTW Linus tree has the fix, I just checked this now. (Linus got David changes 19 hours ago)
On Wed, Nov 20, 2013 at 10:21:24AM -0800, Eric Dumazet wrote: > On Wed, 2013-11-20 at 19:15 +0100, Willy Tarreau wrote: > > > No, I was on latest stable (3.10.19) which exhibits the regression but does > > not yet have your fix above. Then I tested the patch your proposed in this > > thread, then this latest one. Since the patch is not yet even in Linus' > > tree, I'm not sure Arnaud has tried it yet. > > Oh right ;) > > BTW Linus tree has the fix, I just checked this now. > > (Linus got David changes 19 hours ago) Ah yes I didn't look far enough back. Thanks, Willy
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 20 Nov 2013 09:30:07 -0800 > max (128KB , 1-2 ms) of buffering per flow should be enough to reach > line rate, even for a single flow, but this means the sk_sndbuf value > for the socket must take into account the pipe size _plus_ 1ms of > buffering. And we can implement this using the estimated pacing rate.
Hi Thomas, I removed netdev from that reply arno@natisbad.org (Arnaud Ebalard) writes: > <#secure method=pgpmime mode=sign> > Thomas Petazzoni <thomas.petazzoni@free-electrons.com> writes: > >> Arnaud, >> >> On Wed, 20 Nov 2013 00:53:43 +0100, Arnaud Ebalard wrote: >> >>> - It seems I will I have to spend some time on the SATA issues I >>> previously thought were an artefact of not cleaning my tree during a >>> debug session [1], i.e. there is IMHO an issue. >> >> I don't remember in detail what your SATA problem was, but just to let >> you know that we are currently debugging a problem that occurs on >> Armada XP (more than one core is needed for the problem to occur), with >> SATA (the symptom is that after some time of SATA usage, SATA traffic is >> stalled, and no SATA interrupts are generated anymore). We're still >> working on this one, and trying to figure out where the problem is. The problem I had is in the first email of: http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508 Then, yesterday, when testing with the USB 3.0 to ethernet dongle connected to my RN102 (Armada 370) as primary interface, I got the following. It is easily reproducible: [ 317.412873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen [ 317.419947] ata1.00: failed command: READ FPDMA QUEUED [ 317.425118] ata1.00: cmd 60/00:00:00:07:2a/01:00:00:00:00/40 tag 0 ncq 131072 in [ 317.425118] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 317.439926] ata1.00: status: { DRDY } [ 317.443600] ata1.00: failed command: READ FPDMA QUEUED [ 317.448756] ata1.00: cmd 60/00:08:00:08:2a/01:00:00:00:00/40 tag 1 ncq 131072 in [ 317.448756] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 317.463565] ata1.00: status: { DRDY } [ 317.467244] ata1: hard resetting link [ 318.012913] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 318.020220] ata1.00: configured for UDMA/133 [ 318.024559] ata1.00: device reported invalid CHS sector 0 [ 318.030001] ata1.00: device reported invalid CHS sector 0 [ 318.035425] ata1: EH complete And then again: [ 381.342873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen [ 381.349947] ata1.00: failed command: READ FPDMA QUEUED [ 381.355119] ata1.00: cmd 60/00:00:00:03:30/01:00:00:00:00/40 tag 0 ncq 131072 in [ 381.355119] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 381.369927] ata1.00: status: { DRDY } [ 381.373599] ata1.00: failed command: READ FPDMA QUEUED [ 381.378756] ata1.00: cmd 60/00:08:00:04:30/01:00:00:00:00/40 tag 1 ncq 131072 in [ 381.378756] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 381.393563] ata1.00: status: { DRDY } [ 381.397242] ata1: hard resetting link [ 381.942848] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 381.950167] ata1.00: configured for UDMA/133 [ 381.954496] ata1.00: device reported invalid CHS sector 0 [ 381.959958] ata1.00: device reported invalid CHS sector 0 [ 381.965383] ata1: EH complete But, as the problem seems to happen when the dongle is connected and in use (and considering current threads on the topic on USB and netdev ML), I think I will wait for things to calm down and test again with a 3.12 and then a 3.13-rcX. Anyway, if you find something on your bug, I can give patches a try on my RN2120. Cheers, a+
Hi, Willy Tarreau <w@1wt.eu> writes: > On Wed, Nov 20, 2013 at 09:40:22AM -0800, Eric Dumazet wrote: >> On Wed, 2013-11-20 at 18:34 +0100, Willy Tarreau wrote: >> >> > One important point, I was looking for the other patch you pointed >> > in this long thread and finally found it : >> > >> > > So >> > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee >> > > >> > > restored this minimal amount of buffering, and let the bigger amount for >> > > 40Gb NICs ;) >> > >> > This one definitely restores original performance, so it's a much better >> > bet in my opinion :-) >> >> I don't understand. I thought you were using this patch. > > No, I was on latest stable (3.10.19) which exhibits the regression but does > not yet have your fix above. Then I tested the patch your proposed in this > thread, then this latest one. Since the patch is not yet even in Linus' > tree, I'm not sure Arnaud has tried it yet. This is the one I have used for a week now, since its publication by Eric a week ago: http://www.spinics.net/lists/netdev/msg257396.html Cheers, a+
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 3dc0c6c..10456cf 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now, { if (tcp_send_head(sk)) { struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb = tcp_write_queue_tail(sk); if (!(flags & MSG_MORE) || forced_push(tp)) - tcp_mark_push(tp, tcp_write_queue_tail(sk)); + tcp_mark_push(tp, skb); tcp_mark_urg(tp, flags); - __tcp_push_pending_frames(sk, mss_now, - (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle); + if (flags & MSG_MORE) + nonagle = TCP_NAGLE_CORK; + if (atomic_read(&sk->sk_wmem_alloc) > 2048) { + set_bit(TSQ_THROTTLED, &tp->tsq_flags); + nonagle = TCP_NAGLE_CORK; + } + __tcp_push_pending_frames(sk, mss_now, nonagle); } }