Message ID | 20131120191145.GP8581@1wt.eu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, Willy Tarreau <w@1wt.eu> writes: > first, thanks for all these tests. > > On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote: > (...) >> In the end, here are the conclusions *I* draw from this test session, >> do not hesitate to correct me: >> >> - Eric, it seems something changed in linus tree betwen the beginning >> of the thread and now, which somehow reduces the effect of the >> regression we were seen: I never got back the 256KB/s. >> - You revert patch still improves the perf a lot >> - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help >> - w/ your revert patch, I can confirm that mvneta driver is capable of >> doing line rate w/ proper tweak of TCP send window (256KB instead of >> 4M) >> - It seems I will I have to spend some time on the SATA issues I >> previously thought were an artefact of not cleaning my tree during a >> debug session [1], i.e. there is IMHO an issue. > > Could you please try Eric's patch that was just merged into Linus' tree > if it was not yet in the kernel you tried : > > 98e09386c0e tcp: tsq: restore minimal amount of queueing I have it in my quilt set. > For me it restored the original performance (I saturate the Gbps with > about 7 concurrent streams). > > Further, I wrote the small patch below for mvneta. I'm not sure it's > smp-safe but it's a PoC. In mvneta_poll() which currently is only called > upon Rx interrupt, it tries to flush all possible remaining Tx descriptors > if any. That significantly improved my transfer rate, now I easily achieve > 1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet. > > It also increased the overall connection rate by 10% on empty HTTP responses > (small packets), very likely by reducing the dead time between some segments! > > You'll probably want to give it a try, so here it comes. hehe, I was falling short of patches to test tonight ;-) I will give it a try now. Cheers, a+
Hi, Willy Tarreau <w@1wt.eu> writes: > From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Wed, 20 Nov 2013 19:47:11 +0100 > Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx > interrupts > > Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a > timer to flush Tx descriptors. This causes jerky output traffic with bursts > and pauses, making it difficult to reach line rate with very few streams. > This patch tries to improve the situation which is complicated by the lack > of public datasheet from Marvell. The workaround consists in trying to flush > pending buffers during the Rx polling. The idea is that for symmetric TCP > traffic, ACKs received in response to the packets sent will trigger the Rx > interrupt and will anticipate the flushing of the descriptors. > > The results are quite good, a single TCP stream is now capable of saturating > a gigabit. > > This is only a workaround, it doesn't address asymmetric traffic nor datagram > based traffic. > > Signed-off-by: Willy Tarreau <w@1wt.eu> > --- > drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c > index 5aed8ed..59e1c86 100644 > --- a/drivers/net/ethernet/marvell/mvneta.c > +++ b/drivers/net/ethernet/marvell/mvneta.c > @@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget) > } > > pp->cause_rx_tx = cause_rx_tx; > + > + /* Try to flush pending Tx buffers if any */ > + if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) { > + int tx_todo = 0; > + > + mvneta_tx_done_gbe(pp, > + (((1 << txq_number) - 1) & > + MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK), > + &tx_todo); > + > + if (tx_todo > 0) { > + mod_timer(&pp->tx_done_timer, > + jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD)); > + } > + else { > + clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags); > + del_timer(&pp->tx_done_timer); > + } > + } > + > return rx_done; > } With current Linus tree (head being b4789b8e: aacraid: prevent invalid pointer dereference), as a baseline here is what I get: w/ tcp_wmem left w/ default values (4096 16384 4071360) via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s via wget against apache: 15.4 MB/s via wget against nginx: 104 MB/s w/ tcp_wmem set to 4096 16384 262144: via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s via wget against apache: 63.3 MB/s via wget against nginx: 104 MB/s With your patch on top of it (and tcp_wmem kept at its default value): via netperf: 939.16 / 935.44 Mbits/s via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si and 72% CPU for apache2) via wget against nginx: 106 MB/s With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10 (still w/ and tcp_wmem kept at its default value): via netperf: 939.12 / 935.84 Mbits/s via wget against apache: 63.7 MB/s via wget against nginx: 108 MB/s So: - First, Eric's patch sitting in Linus tree does fix the regression I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s). - As can be seen in the results of first test, Eric's patch still requires some additional tweaking of tcp_wmem to get netperf and apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of 15.4MB/s by setting max tcp send buffer space to 256KB for apache). - For unknown reasons, nginx manages to provide a 104MB/s download rate even with a tcp_wmem set to default and no specific patch of mvneta. - Now, Willy's patch seems to makes netperf happy (link saturated from server to client), w/o tweaking tcp_wmem. - Again with Willy's patch I guess the "limitations" of the platform (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the link. All I can say is that the same test some months ago on a 1.6GHz ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not know if it is some apache regression, some mvneta vs mv63xx_eth difference or some CPU frequency issue but having netperf and nginx happy make me wonder about Apache. - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your patch does not improve the already good value I get w/ your patch. In the end if you iterate on your work to push a version of your patch upstream, I'll be happy to test it. And thanks for the time you already spent! Cheers, a+
Hi Arnaud, On Wed, Nov 20, 2013 at 10:28:50PM +0100, Arnaud Ebalard wrote: > With current Linus tree (head being b4789b8e: aacraid: prevent invalid > pointer dereference), as a baseline here is what I get: > > w/ tcp_wmem left w/ default values (4096 16384 4071360) > > via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s > via wget against apache: 15.4 MB/s > via wget against nginx: 104 MB/s > > w/ tcp_wmem set to 4096 16384 262144: > > via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s > via wget against apache: 63.3 MB/s > via wget against nginx: 104 MB/s > > With your patch on top of it (and tcp_wmem kept at its default value): > > via netperf: 939.16 / 935.44 Mbits/s > via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si > and 72% CPU for apache2) > via wget against nginx: 106 MB/s > > > With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10 > (still w/ and tcp_wmem kept at its default value): > > via netperf: 939.12 / 935.84 Mbits/s > via wget against apache: 63.7 MB/s > via wget against nginx: 108 MB/s > > So: > > - First, Eric's patch sitting in Linus tree does fix the regression > I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s). > > - As can be seen in the results of first test, Eric's patch still > requires some additional tweaking of tcp_wmem to get netperf and > apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of > 15.4MB/s by setting max tcp send buffer space to 256KB for apache). > > - For unknown reasons, nginx manages to provide a 104MB/s download rate > even with a tcp_wmem set to default and no specific patch of mvneta. > > - Now, Willy's patch seems to makes netperf happy (link saturated from > server to client), w/o tweaking tcp_wmem. > > - Again with Willy's patch I guess the "limitations" of the platform > (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the > link. All I can say is that the same test some months ago on a 1.6GHz > ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not > know if it is some apache regression, some mvneta vs mv63xx_eth > difference or some CPU frequency issue but having netperf and nginx > happy make me wonder about Apache. > > - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your > patch does not improve the already good value I get w/ your patch. Great, thanks for your detailed tests! Concerning Apache, it's common to see it consume more CPU than others, which makes it more sensible to small devices like these ones (which BTW have a very small cache and only a 16bit RAM bus). Please still note that there could be a number of other differences such as Apache always doing TCP_NODELAY resulting in sending incomplete segments at the end of each buffer, which consume slightly more descriptors. > In the end if you iterate on your work to push a version of your patch > upstream, I'll be happy to test it. And thanks for the time you already > spent! I'm currently trying to implement TX IRQ handling. I found the registers description in the neta driver that is provided in Marvell's LSP kernel that is shipped with some devices using their CPUs. This code is utterly broken (eg: splice fails with -EBADF) but I think the register descriptions could be trusted. I'd rather have real IRQ handling than just relying on mvneta_poll(), so that we can use it for asymmetric traffic/routing/whatever. Regards, Willy
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 5aed8ed..59e1c86 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget) } pp->cause_rx_tx = cause_rx_tx; + + /* Try to flush pending Tx buffers if any */ + if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) { + int tx_todo = 0; + + mvneta_tx_done_gbe(pp, + (((1 << txq_number) - 1) & + MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK), + &tx_todo); + + if (tx_todo > 0) { + mod_timer(&pp->tx_done_timer, + jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD)); + } + else { + clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags); + del_timer(&pp->tx_done_timer); + } + } + return rx_done; }