Message ID | 20131121004430.GX8581@1wt.eu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Dear Willy Tarreau, On Thu, 21 Nov 2013 19:38:34 +0100, Willy Tarreau wrote: > While we were diagnosing a network performance regression that we finally > found and fixed, it appeared during a test that Linus' tree shows a much > higher performance on Armada 370 (armv7) than its predecessors. I can > saturate the two Gig links of my Mirabox each with a single TCP flow and > keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I > can achieve around 1.3 Gbps when the two ports are used in parallel. Interesting finding and analysis, once again! > I'm not at ease with these things so I'd like to ask your opinion here, is > this supposed to be an improvement or a fix ? Is this something we should > backport into stable versions, or is there something to fix in the armada > platform so that it works just as if the patch was applied ? I guess the driver should have been setting its dma_mask to 0xffffffff, since the platform is capable of doing DMA on the first 32 bits of the physical address space, probably something like calling pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that. I know Russell has recently added some helpers to prevent stupid people (like me) from doing mistakes when setting the DMA masks. Certainly worth having a look. Best regards, Thomas
Hi, Willy Tarreau <w@1wt.eu> writes: > OK it paid off. And very well :-) > > I did it at once and it worked immediately. I generally don't like this > because I always fear that some bug was left there hidden in the code. I have > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and > on the XP-GP board for some SMP stress tests. > > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared > with and without the patch. > > without : > - need at least 12 streams to reach gigabit. > - 60% of idle CPU remains at 1 Gbps > - HTTP connection rate on empty objects is 9950 connections/s > - cumulated outgoing traffic on two ports reaches 1.3 Gbps > > with the patch : > - a single stream easily saturates the gigabit > - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream) > - HTTP connection rate on empty objects is 10250 connections/s > - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output. > > BTW I must say I was impressed to see that big an improvement in CPU > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements > that Eric has done in between account for this. > > I cut the patch in 3 parts : > - one which reintroduces the hidden bits of the driver > - one which replaces the timer with the IRQ > - one which changes the default Tx coalesce from 16 to 4 packets > (larger was preferred with the timer, but less is better now). > > I'm attaching them, please test them on your device. Well, on the RN102 (Armada 370), I get the same results as with your previous patch, i.e. netperf and nginx saturate the link. Apache still lagging behind though. > Note that this is *not* for inclusion at the moment as it has not been > tested on the SMP CPUs. I tested it on my RN2120 (2-core armada XP): I got no problem and the link saturated w/ apache, nginx and netperf. Good work! Cheers, a+
Hi Thomas, On Thu, Nov 21, 2013 at 08:04:33PM +0100, Thomas Petazzoni wrote: > > I'm not at ease with these things so I'd like to ask your opinion here, is > > this supposed to be an improvement or a fix ? Is this something we should > > backport into stable versions, or is there something to fix in the armada > > platform so that it works just as if the patch was applied ? > > I guess the driver should have been setting its dma_mask to 0xffffffff, > since the platform is capable of doing DMA on the first 32 bits of the > physical address space, probably something like calling > pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that. Almost, yes. Thanks for the tip! There are so few drivers which do this that I was convinced something was missing (nobody initializes dma_mask on this platform), so calls to dma_set_mask() from drivers return -EIO and are ignored. I ended up with that in mvneta_init() at the end : /* setting DMA mask significantly improves transfer rates */ pp->dev->dev.parent->coherent_dma_mask = DMA_BIT_MASK(32); pp->dev->dev.parent->dma_mask = &pp->dev->dev.parent->coherent_dma_mask; This method changed in 3.12 with Russell's commit fa6a8d6 (DMA-API: provide a helper to setup DMA masks) doing it a cleaner and safer way, using dma_coerce_mask_and_coherent(). Then Rob's commit 0589342 (of: set dma_mask to point to coherent_dma_mask) also merged in 3.12 pre-initialized the dma_mask to point to &coherent_dma_mask for all devices by default. > I know > Russell has recently added some helpers to prevent stupid people (like > me) from doing mistakes when setting the DMA masks. Certainly worth > having a look. My change now allows me to proxy HTTP traffic at 1 Gbps between the two ports of the mirabox, while it limits to 650 Mbps without the change. But it's not needed in mainline anymore. However it might be worth having in older kernels (I don't know if it's suitable for stable since I don't know if that's a bug), or at least in your own kernels if you have to maintain and older branch for some customers. That said, I tend to believe that applying Rob's patch will be better than just the change above since it will cover all drivers, not only mvneta. I'll have to test on the AX3 and the XP-GP to see the performance gain in SMP and using the PCIe. Best regards, Willy
On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote: > Hi, > > Willy Tarreau <w@1wt.eu> writes: > > > OK it paid off. And very well :-) > > > > I did it at once and it worked immediately. I generally don't like this > > because I always fear that some bug was left there hidden in the code. I have > > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and > > on the XP-GP board for some SMP stress tests. > > > > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared > > with and without the patch. > > > > without : > > - need at least 12 streams to reach gigabit. > > - 60% of idle CPU remains at 1 Gbps > > - HTTP connection rate on empty objects is 9950 connections/s > > - cumulated outgoing traffic on two ports reaches 1.3 Gbps > > > > with the patch : > > - a single stream easily saturates the gigabit > > - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream) > > - HTTP connection rate on empty objects is 10250 connections/s > > - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output. > > > > BTW I must say I was impressed to see that big an improvement in CPU > > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements > > that Eric has done in between account for this. > > > > I cut the patch in 3 parts : > > - one which reintroduces the hidden bits of the driver > > - one which replaces the timer with the IRQ > > - one which changes the default Tx coalesce from 16 to 4 packets > > (larger was preferred with the timer, but less is better now). > > > > I'm attaching them, please test them on your device. > > Well, on the RN102 (Armada 370), I get the same results as with your > previous patch, i.e. netperf and nginx saturate the link. Apache still > lagging behind though. > > > Note that this is *not* for inclusion at the moment as it has not been > > tested on the SMP CPUs. > > I tested it on my RN2120 (2-core armada XP): I got no problem and the > link saturated w/ apache, nginx and netperf. Good work! Great, thanks for your tests Arnaud. I forgot to mention that all my tests this evening involved this patch as well. Cheers, Willy
On Thu, 2013-11-21 at 22:52 +0100, Willy Tarreau wrote: > On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote: > > Hi, > > > > Willy Tarreau <w@1wt.eu> writes: > > > > > OK it paid off. And very well :-) > > > > > > I did it at once and it worked immediately. I generally don't like this > > > because I always fear that some bug was left there hidden in the code. I have > > > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and > > > on the XP-GP board for some SMP stress tests. > > > > > > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared > > > with and without the patch. > > > > > > without : > > > - need at least 12 streams to reach gigabit. > > > - 60% of idle CPU remains at 1 Gbps > > > - HTTP connection rate on empty objects is 9950 connections/s > > > - cumulated outgoing traffic on two ports reaches 1.3 Gbps > > > > > > with the patch : > > > - a single stream easily saturates the gigabit > > > - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream) > > > - HTTP connection rate on empty objects is 10250 connections/s > > > - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output. > > > > > > BTW I must say I was impressed to see that big an improvement in CPU > > > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements > > > that Eric has done in between account for this. > > > > > > I cut the patch in 3 parts : > > > - one which reintroduces the hidden bits of the driver > > > - one which replaces the timer with the IRQ > > > - one which changes the default Tx coalesce from 16 to 4 packets > > > (larger was preferred with the timer, but less is better now). > > > > > > I'm attaching them, please test them on your device. > > > > Well, on the RN102 (Armada 370), I get the same results as with your > > previous patch, i.e. netperf and nginx saturate the link. Apache still > > lagging behind though. > > > > > Note that this is *not* for inclusion at the moment as it has not been > > > tested on the SMP CPUs. > > > > I tested it on my RN2120 (2-core armada XP): I got no problem and the > > link saturated w/ apache, nginx and netperf. Good work! > > Great, thanks for your tests Arnaud. I forgot to mention that all my > tests this evening involved this patch as well. Now you might try to set a lower value for /proc/sys/net/ipv4/tcp_limit_output_bytes Ideally, a value of 8192 (instead of 131072) allows to queue less data per tcp flow, and react faster to losses, as retransmits don't have to wait that previous packets in Qdisc left the host. 131072 bytes for a 80 Mbit flow means more than 11 ms of queueing :(
On 11/21/2013 12:38 PM, Willy Tarreau wrote: > Hi Rob, > > While we were diagnosing a network performance regression that we finally > found and fixed, it appeared during a test that Linus' tree shows a much > higher performance on Armada 370 (armv7) than its predecessors. I can > saturate the two Gig links of my Mirabox each with a single TCP flow and > keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I > can achieve around 1.3 Gbps when the two ports are used in parallel. > > Today I bisected these kernels to find what was causing this difference. > I found it was your patch below which I can copy entirely here : > > commit 0589342c27944e50ebd7a54f5215002b6598b748 > Author: Rob Herring <rob.herring@calxeda.com> > Date: Tue Oct 29 23:36:46 2013 -0500 > > of: set dma_mask to point to coherent_dma_mask > > Platform devices created by DT code don't initialize dma_mask pointer to > anything. Set it to coherent_dma_mask by default if the architecture > code has not set it. > > Signed-off-by: Rob Herring <rob.herring@calxeda.com> > > diff --git a/drivers/of/platform.c b/drivers/of/platform.c > index 9b439ac..c005495 100644 > --- a/drivers/of/platform.c > +++ b/drivers/of/platform.c > @@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata( > dev->archdata.dma_mask = 0xffffffffUL; > #endif > dev->dev.coherent_dma_mask = DMA_BIT_MASK(32); > + if (!dev->dev.dma_mask) > + dev->dev.dma_mask = &dev->dev.coherent_dma_mask; > dev->dev.bus = &platform_bus_type; > dev->dev.platform_data = platform_data; > > And I can confirm that applying this patch on 3.10.20 + the fixes we found > yesterday substantially boosted my network performance (and reduced the CPU > usage when running on a single link). > > I'm not at ease with these things so I'd like to ask your opinion here, is > this supposed to be an improvement or a fix ? Is this something we should > backport into stable versions, or is there something to fix in the armada > platform so that it works just as if the patch was applied ? > The patch was to fix this issue[1]. It is fixed in the core code because dma_mask not being set has been a known issue with DT probing for some time. Since most drivers don't seem to care, we've gotten away with it. I thought the normal failure mode was drivers failing to probe. As to why it helps performance, I'm not really sure. Perhaps it is causing some bounce buffers to be used. Rob [1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html
On Thu, Nov 21, 2013 at 04:01:42PM -0600, Rob Herring wrote: > The patch was to fix this issue[1]. It is fixed in the core code because > dma_mask not being set has been a known issue with DT probing for some > time. Since most drivers don't seem to care, we've gotten away with it. > I thought the normal failure mode was drivers failing to probe. It seems that very few drivers try to set their mask, so probably that the default value was already OK even though less performant. > As to why it helps performance, I'm not really sure. Perhaps it is > causing some bounce buffers to be used. That's also the thing I have been thinking about, and given this device only has a 16-bit DDR bus, bounce buffers can make a difference. > Rob > > [1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html Thanks for your quick explanation Rob! Willy
Hi eric, Eric Dumazet <eric.dumazet@gmail.com> writes: >> > I tested it on my RN2120 (2-core armada XP): I got no problem and the >> > link saturated w/ apache, nginx and netperf. Good work! >> >> Great, thanks for your tests Arnaud. I forgot to mention that all my >> tests this evening involved this patch as well. > > Now you might try to set a lower value > for /proc/sys/net/ipv4/tcp_limit_output_bytes On the RN2120, for a file served from /run/shm (for apache and nginx): Apache nginx netperf 131072: 102 MB/s 112 MB/s 941.11 Mb/s 65536: 102 MB/s 112 MB/s 935.97 Mb/s 32768: 101 MB/s 105 MB/s 940.49 Mb/s 16384: 94 MB/s 90 MB/s 770.07 Mb/s 8192: 83 MB/s 66 MB/s 556.79 Mb/s On the RN102, this time for apache and nginx, the file is served from disks (ext4/lvm/raid1): Apache nginx netperf 131072: 66 MB/s 105 MB/s 925.63 Mb/s 65536: 59 MB/s 105 MB/s 862.55 Mb/s 32768: 62 MB/s 105 MB/s 918.99 Mb/s 16384: 65 MB/s 105 MB/s 927.71 Mb/s 8192: 60 MB/s 104 MB/s 915.63 Mb/s Values above are for a single flow though. Cheers, a+
On 11/21/2013 02:55 PM, Arnaud Ebalard wrote: > On the RN2120, for a file served from /run/shm (for apache and nginx): > > Apache nginx netperf > 131072: 102 MB/s 112 MB/s 941.11 Mb/s > 65536: 102 MB/s 112 MB/s 935.97 Mb/s > 32768: 101 MB/s 105 MB/s 940.49 Mb/s > 16384: 94 MB/s 90 MB/s 770.07 Mb/s > 8192: 83 MB/s 66 MB/s 556.79 Mb/s If you want to make the units common across all three tests, netperf accepts a global -f option to alter the output units. If you add -f M netperf will then emit results in MB/s (M == 1048576). I'm assuming of course that the MB/s of Apache and nginx are also M == 1048576. happy benchmarking, rick jones
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index b8e232b..6630690 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -101,16 +101,56 @@ #define MVNETA_CPU_RXQ_ACCESS_ALL_MASK 0x000000ff #define MVNETA_CPU_TXQ_ACCESS_ALL_MASK 0x0000ff00 #define MVNETA_RXQ_TIME_COAL_REG(q) (0x2580 + ((q) << 2)) + +/* Exception Interrupt Port/Queue Cause register */ + #define MVNETA_INTR_NEW_CAUSE 0x25a0 -#define MVNETA_RX_INTR_MASK(nr_rxqs) (((1 << nr_rxqs) - 1) << 8) #define MVNETA_INTR_NEW_MASK 0x25a4 + +/* bits 0..7 = TXQ SENT, one bit per queue. + * bits 8..15 = RXQ OCCUP, one bit per queue. + * bits 16..23 = RXQ FREE, one bit per queue. + * bit 29 = OLD_REG_SUM, see old reg ? + * bit 30 = TX_ERR_SUM, one bit for 4 ports + * bit 31 = MISC_SUM, one bit for 4 ports + */ +#define MVNETA_TX_INTR_MASK(nr_txqs) (((1 << nr_txqs) - 1) << 0) +#define MVNETA_TX_INTR_MASK_ALL (0xff << 0) +#define MVNETA_RX_INTR_MASK(nr_rxqs) (((1 << nr_rxqs) - 1) << 8) +#define MVNETA_RX_INTR_MASK_ALL (0xff << 8) + #define MVNETA_INTR_OLD_CAUSE 0x25a8 #define MVNETA_INTR_OLD_MASK 0x25ac + +/* Data Path Port/Queue Cause Register */ #define MVNETA_INTR_MISC_CAUSE 0x25b0 #define MVNETA_INTR_MISC_MASK 0x25b4 + +#define MVNETA_CAUSE_PHY_STATUS_CHANGE BIT(0) +#define MVNETA_CAUSE_LINK_CHANGE BIT(1) +#define MVNETA_CAUSE_PTP BIT(4) + +#define MVNETA_CAUSE_INTERNAL_ADDR_ERR BIT(7) +#define MVNETA_CAUSE_RX_OVERRUN BIT(8) +#define MVNETA_CAUSE_RX_CRC_ERROR BIT(9) +#define MVNETA_CAUSE_RX_LARGE_PKT BIT(10) +#define MVNETA_CAUSE_TX_UNDERUN BIT(11) +#define MVNETA_CAUSE_PRBS_ERR BIT(12) +#define MVNETA_CAUSE_PSC_SYNC_CHANGE BIT(13) +#define MVNETA_CAUSE_SERDES_SYNC_ERR BIT(14) + +#define MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT 16 +#define MVNETA_CAUSE_BMU_ALLOC_ERR_ALL_MASK (0xF << MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT) +#define MVNETA_CAUSE_BMU_ALLOC_ERR_MASK(pool) (1 << (MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT + (pool))) + +#define MVNETA_CAUSE_TXQ_ERROR_SHIFT 24 +#define MVNETA_CAUSE_TXQ_ERROR_ALL_MASK (0xFF << MVNETA_CAUSE_TXQ_ERROR_SHIFT) +#define MVNETA_CAUSE_TXQ_ERROR_MASK(q) (1 << (MVNETA_CAUSE_TXQ_ERROR_SHIFT + (q))) + #define MVNETA_INTR_ENABLE 0x25b8 #define MVNETA_TXQ_INTR_ENABLE_ALL_MASK 0x0000ff00 -#define MVNETA_RXQ_INTR_ENABLE_ALL_MASK 0xff000000 +#define MVNETA_RXQ_INTR_ENABLE_ALL_MASK 0xff000000 // note: neta says it's 0x000000FF + #define MVNETA_RXQ_CMD 0x2680 #define MVNETA_RXQ_DISABLE_SHIFT 8 #define MVNETA_RXQ_ENABLE_MASK 0x000000ff