diff mbox

[3/6] net: thunderx: Increase transmit queue length

Message ID 1448961223-41888-4-git-send-email-sunil.kovvuri@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Sunil Kovvuri Dec. 1, 2015, 9:13 a.m. UTC
From: Sunil Goutham <sgoutham@cavium.com>

Under high transmit rates and with TSO enabled observing fluctuations
in TX performance. Seen especially with iperf3 application.

Since TSO is taken care at driver level, with 64KB of TSO packets
and when window size is also high the rate at which CPU fills in
transmit descriptors is much higher than what HW is able to process.
Each 64KB TSO packet occupies gets segmented to ~43 1500 byte MTU packets
and occupies ~130 descriptors. Hence increasing transmit queue size.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

Comments

Pavel Fedin Dec. 1, 2015, 2:40 p.m. UTC | #1
Hello!

 This one breaks networking on my machine:
--- cut ---
swiotlb: coherent allocation failed for device 0002:01:08.4 size=4198400
CPU: 2 PID: 3655 Comm: NetworkManager Tainted: G        W  O    4.2.6+ #201
Hardware name: Cavium ThunderX CN88XX board (DT)
Call trace:
[<ffffffc00008a3d4>] dump_backtrace+0x0/0x124
[<ffffffc00008a50c>] show_stack+0x14/0x1c
[<ffffffc0006df258>] dump_stack+0x88/0xc8
[<ffffffc000416994>] swiotlb_alloc_coherent+0x150/0x168
[<ffffffc00009555c>] __dma_alloc+0x58/0x1e8
[<ffffffc000507cd8>] nicvf_alloc_q_desc_mem.isra.14+0xc0/0xdc
[<ffffffc000508c8c>] nicvf_config_data_transfer+0x448/0x728
[<ffffffc000506c64>] nicvf_open+0x184/0x994
[<ffffffc0005ea0b4>] __dev_open+0xb4/0x120
[<ffffffc0005ea388>] __dev_change_flags+0x8c/0x150
[<ffffffc0005ea46c>] dev_change_flags+0x20/0x5c
[<ffffffc0005fb210>] do_setlink+0x27c/0x8e0
[<ffffffc0005fbe28>] rtnl_newlink+0x4b4/0x6c4
[<ffffffc0005fa6f0>] rtnetlink_rcv_msg+0x90/0x22c
[<ffffffc000611b74>] netlink_rcv_skb+0xd8/0x100
[<ffffffc0005fa650>] rtnetlink_rcv+0x30/0x40
[<ffffffc000611498>] netlink_unicast+0x158/0x1f8
[<ffffffc000611910>] netlink_sendmsg+0x324/0x380
[<ffffffc0005c6690>] sock_sendmsg+0x18/0x58
[<ffffffc0005c6db8>] ___sys_sendmsg+0x254/0x264
[<ffffffc0005c7bb8>] __sys_sendmsg+0x40/0x84
[<ffffffc0005c7c0c>] SyS_sendmsg+0x10/0x20
thunder-nicvf 0002:01:08.4 enP2p1s8f4: Failed to alloc/config VF's QSet resources
--- cut ---

 And none of interfaces work after this. Reverting this patch helps.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Sunil
> Goutham
> Sent: Tuesday, December 01, 2015 12:14 PM
> To: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> Sunil.Goutham@caviumnetworks.com; Sunil Goutham
> Subject: [PATCH 3/6] net: thunderx: Increase transmit queue length
> 
> From: Sunil Goutham <sgoutham@cavium.com>
> 
> Under high transmit rates and with TSO enabled observing fluctuations
> in TX performance. Seen especially with iperf3 application.
> 
> Since TSO is taken care at driver level, with 64KB of TSO packets
> and when window size is also high the rate at which CPU fills in
> transmit descriptors is much higher than what HW is able to process.
> Each 64KB TSO packet occupies gets segmented to ~43 1500 byte MTU packets
> and occupies ~130 descriptors. Hence increasing transmit queue size.
> 
> Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
> ---
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.h |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> index fb4957d..b1e93a9 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> @@ -62,7 +62,7 @@
>  #define SND_QUEUE_CNT		8
>  #define CMP_QUEUE_CNT		8 /* Max of RCV and SND qcount */
> 
> -#define SND_QSIZE		SND_QUEUE_SIZE2
> +#define SND_QSIZE		SND_QUEUE_SIZE3
>  #define SND_QUEUE_LEN		(1ULL << (SND_QSIZE + 10))
>  #define MAX_SND_QUEUE_LEN	(1ULL << (SND_QUEUE_SIZE6 + 10))
>  #define SND_QUEUE_THRESH	2ULL
> @@ -73,7 +73,7 @@
>  /* Keep CQ and SQ sizes same, if timestamping
>   * is enabled this equation will change.
>   */
> -#define CMP_QSIZE		CMP_QUEUE_SIZE2
> +#define CMP_QSIZE		CMP_QUEUE_SIZE3
>  #define CMP_QUEUE_LEN		(1ULL << (CMP_QSIZE + 10))
>  #define CMP_QUEUE_CQE_THRESH	0
>  #define CMP_QUEUE_TIMER_THRESH	220 /* 10usec */
> --
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 1, 2015, 3:33 p.m. UTC | #2
On Tue, 2015-12-01 at 17:40 +0300, Pavel Fedin wrote:
>  Hello!
> 
>  This one breaks networking on my machine:
> --- cut ---
> swiotlb: coherent allocation failed for device 0002:01:08.4 size=4198400
> CPU: 2 PID: 3655 Comm: NetworkManager Tainted: G        W  O    4.2.6+ #201
> Hardware name: Cavium ThunderX CN88XX

Are you sure 4.2.6 kernel is suitable for backporting this patch aimed
for linux-4.5 ?
Sunil Kovvuri Dec. 1, 2015, 4:30 p.m. UTC | #3
Hi Pavel Fedin,

Increasing descriptor ring size will lead to more memory allocation.
And what you are seeing is a memory alloc failure and doesn't seem to
be due to this driver change.
I mean it looks like the behavior will be same for other drivers as well.

Probably you might have to set "coherent_pool" size in bootargs to a
higher value.
Can you please check.

Thanks,
Sunil.

On Tue, Dec 1, 2015 at 9:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-12-01 at 17:40 +0300, Pavel Fedin wrote:
>>  Hello!
>>
>>  This one breaks networking on my machine:
>> --- cut ---
>> swiotlb: coherent allocation failed for device 0002:01:08.4 size=4198400
>> CPU: 2 PID: 3655 Comm: NetworkManager Tainted: G        W  O    4.2.6+ #201
>> Hardware name: Cavium ThunderX CN88XX
>
> Are you sure 4.2.6 kernel is suitable for backporting this patch aimed
> for linux-4.5 ?
>
>
>
David Miller Dec. 1, 2015, 7:30 p.m. UTC | #4
From: Sunil Kovvuri <sunil.kovvuri@gmail.com>
Date: Tue, 1 Dec 2015 22:00:49 +0530

> Increasing descriptor ring size will lead to more memory allocation.
> And what you are seeing is a memory alloc failure and doesn't seem
> to be due to this driver change.  I mean it looks like the behavior
> will be same for other drivers as well.

The driver should successfully recover from out of memory situations
and not stop RX/TX completely.

There might be some other problem with your driver causing this.

Don't put this off as not "related" to your patch, it is as this
introduces the behavior for this user, and you should fix it before
expecting me to apply this patch series.
Sunil Kovvuri Dec. 2, 2015, 5:48 a.m. UTC | #5
>The driver should successfully recover from out of memory situations
> and not stop RX/TX completely.
This memory allocation is while interface bringup/initialization and not during
packet I/O.

>Don't put this off as not "related" to your patch, it is as this
>introduces the behavior for this user, and you should fix it before
>expecting me to apply this patch series.
I would disagree on this, as this patch hasn't introduced any failure here,
if this user has connected any device which asks for a bit large amount
of coherent memory then i am sure he will see the same issue.
And above i have suggested what could be done to not see this issue.

But anyway for the time being I will resubmit the patch series without this
patch, hope that would be okay.

On Wed, Dec 2, 2015 at 1:00 AM, David Miller <davem@davemloft.net> wrote:
> From: Sunil Kovvuri <sunil.kovvuri@gmail.com>
> Date: Tue, 1 Dec 2015 22:00:49 +0530
>
>> Increasing descriptor ring size will lead to more memory allocation.
>> And what you are seeing is a memory alloc failure and doesn't seem
>> to be due to this driver change.  I mean it looks like the behavior
>> will be same for other drivers as well.
>
> The driver should successfully recover from out of memory situations
> and not stop RX/TX completely.
>
> There might be some other problem with your driver causing this.
>
> Don't put this off as not "related" to your patch, it is as this
> introduces the behavior for this user, and you should fix it before
> expecting me to apply this patch series.
Pavel Fedin Dec. 2, 2015, 8:09 a.m. UTC | #6
Hello!

> > swiotlb: coherent allocation failed for device 0002:01:08.4 size=4198400
> > CPU: 2 PID: 3655 Comm: NetworkManager Tainted: G        W  O    4.2.6+ #201
> > Hardware name: Cavium ThunderX CN88XX
> 
> Are you sure 4.2.6 kernel is suitable for backporting this patch aimed
> for linux-4.5 ?

 Why not? It's just a networking driver. And, i also have the same problem on 4.3 running as KVM guest with VFIO.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin Dec. 2, 2015, 9:05 a.m. UTC | #7
Hello!

> Probably you might have to set "coherent_pool" size in bootargs to a
> higher value.
> Can you please check.

 I have tried to do this. I was able to enlarge the pool up to 4MB, and still got allocation failures. At 8MB pool preallocation stops working:
--- cut ---
Call trace:
[<ffffffc00012ddb8>] __alloc_pages_nodemask+0x4f4/0x7d4
[<ffffffc0007be370>] atomic_pool_init+0x60/0x1a4
[<ffffffc0007be4d4>] arm64_dma_init+0x20/0x28
[<ffffffc000082848>] do_one_initcall+0x8c/0x1a4
[<ffffffc0007baac0>] kernel_init_freeable+0x154/0x1f4
[<ffffffc0005c2b14>] kernel_init+0x10/0xd8
DMA: failed to allocate 8192 KiB pool for atomic coherent allocation
--- cut ---
 and i get even worse faults in the driver.

 I know that it is possible to allocate larger pools by setting CONFIG_FORCE_MAX_ZONEORDER, but:
a) This is done on per-platform basis. For ThunderX we used to have a patch (http://www.spinics.net/lists/arm-kernel/msg415457.html), which never made it upstream, because vGIC fixes stopped requiring it at some point. And also we may want to use the nicvf driver not only on actual hardware, but also inside virtual machine in KVM. So do we need to set CONFIG_FORCE_MAX_ZONEORDER for virt too? And what if at some point qemu emulates not only "virt", but some other machine (let's say AMD X-Gene), and we run it on ThunderX and want to use nicvf with this model?
b) IMHO it's not good to have a driver which simply does not work without some obscure option in boot arguments.

 So, i see several possible ways to solve this:

1. Introduce some mechanism which would allow the driver to tell the kernel that it needs coherent pool of large size. Can be problematic because the driver can be a module, and pool allocation happens early.
2. Can we use some other method for allocating queues, which would not require such a huge coherent pool?
3. The driver could check value of atomic_pool_size and adjust own memory requirements accordingly. This indeed looks like a quick hack, but would at least make things running quickly.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin Dec. 2, 2015, 10:31 a.m. UTC | #8
Hello!

> > Probably you might have to set "coherent_pool" size in bootargs to a
> > higher value.
> > Can you please check.
> 
>  I have tried to do this. I was able to enlarge the pool up to 4MB, and still got allocation
> failures. At 8MB pool preallocation stops working:
> --- cut ---
> Call trace:
> [<ffffffc00012ddb8>] __alloc_pages_nodemask+0x4f4/0x7d4
> [<ffffffc0007be370>] atomic_pool_init+0x60/0x1a4
> [<ffffffc0007be4d4>] arm64_dma_init+0x20/0x28
> [<ffffffc000082848>] do_one_initcall+0x8c/0x1a4
> [<ffffffc0007baac0>] kernel_init_freeable+0x154/0x1f4
> [<ffffffc0005c2b14>] kernel_init+0x10/0xd8
> DMA: failed to allocate 8192 KiB pool for atomic coherent allocation
> --- cut ---
>  and i get even worse faults in the driver.
> 
>  I know that it is possible to allocate larger pools by setting CONFIG_FORCE_MAX_ZONEORDER,
> but:
> a) This is done on per-platform basis. For ThunderX we used to have a patch
> (http://www.spinics.net/lists/arm-kernel/msg415457.html), which never made it upstream,
> because vGIC fixes stopped requiring it at some point. And also we may want to use the nicvf
> driver not only on actual hardware, but also inside virtual machine in KVM. So do we need to
> set CONFIG_FORCE_MAX_ZONEORDER for virt too? And what if at some point qemu emulates not only
> "virt", but some other machine (let's say AMD X-Gene), and we run it on ThunderX and want to
> use nicvf with this model?
> b) IMHO it's not good to have a driver which simply does not work without some obscure option
> in boot arguments.
> 
>  So, i see several possible ways to solve this:
> 
> 1. Introduce some mechanism which would allow the driver to tell the kernel that it needs
> coherent pool of large size. Can be problematic because the driver can be a module, and pool
> allocation happens early.
> 2. Can we use some other method for allocating queues, which would not require such a huge
> coherent pool?
> 3. The driver could check value of atomic_pool_size and adjust own memory requirements
> accordingly. This indeed looks like a quick hack, but would at least make things running
> quickly.

 I have also noticed that CONFIG_DMA_CMA is turned off in my kernel. I guess it was a leftover from old defconfig, because i carry over my .config from version to version. I enabled it and rebuilt the kernel, but in order to get the driver working with this patch i had to also add cma=32M option to kernel arguments. With default of 16M the allocation still fails.
 Should we add Kconfig dependencies?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin Dec. 2, 2015, 12:29 p.m. UTC | #9
Hello!

> >  So, i see several possible ways to solve this:
> >
> > 1. Introduce some mechanism which would allow the driver to tell the kernel that it needs
> > coherent pool of large size. Can be problematic because the driver can be a module, and pool
> > allocation happens early.
> > 2. Can we use some other method for allocating queues, which would not require such a huge
> > coherent pool?
> > 3. The driver could check value of atomic_pool_size and adjust own memory requirements
> > accordingly. This indeed looks like a quick hack, but would at least make things running
> > quickly.
> 
>  I have also noticed that CONFIG_DMA_CMA is turned off in my kernel. I guess it was a leftover
> from old defconfig, because i carry over my .config from version to version. I enabled it and
> rebuilt the kernel, but in order to get the driver working with this patch i had to also add
> cma=32M option to kernel arguments. With default of 16M the allocation still fails.
>  Should we add Kconfig dependencies?

 After getting it working in guest i tried to apply it to host. With total of 128 virtual functions (= 128 interfaces) it does not work at all. Even after bumping cma region size to insane value of 2GB more than half of interfaces still failed to allocate queues. And after setting cma=3G i could not mount my rootfs.
 So, absolute NAK, unfortunately.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Sunil Kovvuri Dec. 2, 2015, 12:57 p.m. UTC | #10
>After getting it working in guest i tried to apply it to host. With total of 128 virtual functions (= 128 interfaces) it does not work at all.
> Even after bumping cma region size to insane value of 2GB more than half of interfaces still failed to allocate queues.
> And after setting cma=3G i could not mount my rootfs.

Here what you are saying is half of the interfaces were initialized
succesfully and rest didn't.
So this issue is not something which is introduced by this patch.

> Can we use some other method for allocating queues, which would not require such a huge coherent pool?
There are so many drivers which use dma_alloc_coherent() directly or
via pci_alloc_consistent() to
allocate memory. How many drivers should we modify.
Pavel Fedin Dec. 2, 2015, 1:22 p.m. UTC | #11
Hello!

> >After getting it working in guest i tried to apply it to host. With total of 128 virtual
> functions (= 128 interfaces) it does not work at all.
> > Even after bumping cma region size to insane value of 2GB more than half of interfaces still
> failed to allocate queues.
> > And after setting cma=3G i could not mount my rootfs.
> 
> Here what you are saying is half of the interfaces were initialized
> succesfully and rest didn't.

 After setting cma=2G. With default setting of 16M none of them initialized.

> So this issue is not something which is introduced by this patch.

 Before this patch all my interfaces were working.
 I would say the problem with your patch is that it introduces memory requirements which cannot be satisfied by the platform. It's combination of several factors which stops the thing from working, not a single factor. Using dma_alloc_coherent() is not all wrong by itself, of course.
 Perhaps you did some tricks with your configuration, which make it working. Then, i guess, you should have at least described them in commit message of your patch. Or describe all dependencies in KConfig of your driver, which is better. Or, if the platform needs some very special defconfig, add it to arch/arm64/configs (however, i guess, the goal of ARM64 Linux is to run on all possible hardware, so this would not be good from maintainers' POV).

 Sorry, but this is all i can say. In previous messages i have already suggested several ways to solve the problem (too lazy to quite here, 4 IIRC), or you can suggest your own one and let us test it, or you can even stick to "It works for me, i am the only right guy in the world, and i don't care if it doesn't work for you" position and let David decide who of us is right (and he already did that once).
 Basically, here is what i did: i took kernel 4.2, added ThunderX PCI drivers to it (they were posted but NAKed those days back, there's some lazy progress on them currently), added necessary errata patches (also posted on lists, all merged into 4.4), took defconfig, adjusted it according to my needs, and this is what i'm running on my board and this is what i'm using for development. If you point me at what i'm doing wrong way, i'll be glad to accept this.
 I'm over.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Eric Dumazet Dec. 2, 2015, 1:25 p.m. UTC | #12
On Wed, 2015-12-02 at 11:18 +0530, Sunil Kovvuri wrote:
> >The driver should successfully recover from out of memory situations
> > and not stop RX/TX completely.
> This memory allocation is while interface bringup/initialization and not during
> packet I/O.
> 
> >Don't put this off as not "related" to your patch, it is as this
> >introduces the behavior for this user, and you should fix it before
> >expecting me to apply this patch series.
> I would disagree on this, as this patch hasn't introduced any failure here,
> if this user has connected any device which asks for a bit large amount
> of coherent memory then i am sure he will see the same issue.
> And above i have suggested what could be done to not see this issue.


This is unacceptable.

Maybe you did not complete tests. changelog has no 'Tested:' section.

You can not claim this patch was good, especially considering no precise
numbers were given.

If the performance increase is 4 %, then surely using twice more memory
is not worth it.

RX/TX ring buffer sizes should be :

- Default to reasonable sizes (ie not gigantic memory usage for typical
use). For multiqueue devices, one also has to take into account the
number of queues: 

  If a 10Gbit NIC has 128 queues, then probably having 8192 slots
  per queue is too much, if this maps to 512 MB of memory !

- ethtool -G support to let the admin change the conservative settings
for the exceptional workloads.
Sunil Kovvuri Dec. 2, 2015, 4:50 p.m. UTC | #13
>
> If the performance increase is 4 %, then surely using twice more memory
> is not worth it.
I haven't mentioned anywhere that i am seeing 4% increase in performance.
Anyways as said already i will recheck and resubmit.
Eric Dumazet Dec. 2, 2015, 4:59 p.m. UTC | #14
On Wed, 2015-12-02 at 22:20 +0530, Sunil Kovvuri wrote:
> >
> > If the performance increase is 4 %, then surely using twice more memory
> > is not worth it.
> I haven't mentioned anywhere that i am seeing 4% increase in performance.

Yes, this is the complain I made :

No numbers, just a patch increasing ring size by a 100% factor !

I really hope you'll have very convincing numbers, especially if your
driver does not implement BQL.
David Miller Dec. 2, 2015, 5:31 p.m. UTC | #15
From: Sunil Kovvuri <sunil.kovvuri@gmail.com>
Date: Wed, 2 Dec 2015 11:18:43 +0530

>>The driver should successfully recover from out of memory situations
>> and not stop RX/TX completely.
> This memory allocation is while interface bringup/initialization and not during
> packet I/O.
> 
>>Don't put this off as not "related" to your patch, it is as this
>>introduces the behavior for this user, and you should fix it before
>>expecting me to apply this patch series.
> I would disagree on this, as this patch hasn't introduced any failure here,
> if this user has connected any device which asks for a bit large amount
> of coherent memory then i am sure he will see the same issue.

It's not the memory allocation that's the problem.

It's the the device completely dies and does not recover even when
memory does become available later.

That is a hard regression which this change introduces.
diff mbox

Patch

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index fb4957d..b1e93a9 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -62,7 +62,7 @@ 
 #define SND_QUEUE_CNT		8
 #define CMP_QUEUE_CNT		8 /* Max of RCV and SND qcount */
 
-#define SND_QSIZE		SND_QUEUE_SIZE2
+#define SND_QSIZE		SND_QUEUE_SIZE3
 #define SND_QUEUE_LEN		(1ULL << (SND_QSIZE + 10))
 #define MAX_SND_QUEUE_LEN	(1ULL << (SND_QUEUE_SIZE6 + 10))
 #define SND_QUEUE_THRESH	2ULL
@@ -73,7 +73,7 @@ 
 /* Keep CQ and SQ sizes same, if timestamping
  * is enabled this equation will change.
  */
-#define CMP_QSIZE		CMP_QUEUE_SIZE2
+#define CMP_QSIZE		CMP_QUEUE_SIZE3
 #define CMP_QUEUE_LEN		(1ULL << (CMP_QSIZE + 10))
 #define CMP_QUEUE_CQE_THRESH	0
 #define CMP_QUEUE_TIMER_THRESH	220 /* 10usec */