[v2] RDMA/mlx5 : Reclaim max 50K pages at once

Message ID	20240613121252.93315-1-anand.a.khoje@oracle.com (mailing list archive)
State	Changes Requested
Headers	show Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08E4B145B14; Thu, 13 Jun 2024 12:12:59 +0000 (UTC) From: Anand Khoje <anand.a.khoje@oracle.com> To: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: anand.a.khoje@oracle.com, rama.nichanamatlu@oracle.com, manjunath.b.patil@oracle.com Subject: [PATCH v2] RDMA/mlx5 : Reclaim max 50K pages at once Date: Thu, 13 Jun 2024 17:42:52 +0530 Message-ID: <20240613121252.93315-1-anand.a.khoje@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	[v2] RDMA/mlx5 : Reclaim max 50K pages at once \| expand [v2] RDMA/mlx5 : Reclaim max 50K pages at once

Message ID

20240613121252.93315-1-anand.a.khoje@oracle.com (mailing list archive)

State

Changes Requested

Headers

From: Anand Khoje <anand.a.khoje@oracle.com>
To: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
        netdev@vger.kernel.org
Cc: anand.a.khoje@oracle.com, rama.nichanamatlu@oracle.com,
        manjunath.b.patil@oracle.com
Subject: [PATCH v2] RDMA/mlx5 : Reclaim max 50K pages at once
Date: Thu, 13 Jun 2024 17:42:52 +0530
Message-ID: <20240613121252.93315-1-anand.a.khoje@oracle.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Series

[v2] RDMA/mlx5 : Reclaim max 50K pages at once | expand

Commit Message

Anand Khoje June 13, 2024, 12:12 p.m. UTC

In non FLR context, at times CX-5 requests release of ~8 million FW pages.
This needs humongous number of cmd mailboxes, which to be released once
the pages are reclaimed. Release of humongous number of cmd mailboxes is
consuming cpu time running into many seconds. Which with non preemptible
kernels is leading to critical process starving on that cpu’s RQ.
To alleviate this, this change restricts the total number of pages
a worker will try to reclaim maximum 50K pages in one go.
The limit 50K is aligned with the current firmware capacity/limit of
releasing 50K pages at once per MLX5_CMD_OP_MANAGE_PAGES + MLX5_PAGES_TAKE
device command.

Our tests have shown significant benefit of this change in terms of
time consumed by dma_pool_free().
During a test where an event was raised by HCA
to release 1.3 Million pages, following observations were made:

- Without this change:
Number of mailbox messages allocated was around 20K, to accommodate
the DMA addresses of 1.3 million pages.
The average time spent by dma_pool_free() to free the DMA pool is between
16 usec to 32 usec.
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@                                        287
            1024 |@@@                                      1332
            2048 |@                                        656
            4096 |@@@@@                                    2599
            8192 |@@@@@@@@@@                               4755
           16384 |@@@@@@@@@@@@@@@                          7545
           32768 |@@@@@                                    2501
           65536 |                                         0

- With this change:
Number of mailbox messages allocated was around 800; this was to
accommodate DMA addresses of only 50K pages.
The average time spent by dma_pool_free() to free the DMA pool in this case
lies between 1 usec to 2 usec.
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@                       346
            1024 |@@@@@@@@@@@@@@@@@@@@@@                   435
            2048 |                                         0
            4096 |                                         0
            8192 |                                         1
           16384 |                                         0

Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
---
Changes in v2:
 - In v1, CPUs were yielded if more than 2 msec are spent in
   mlx5_free_cmd_msg(). The approach to limit the time spent is changed
   in this version.
---
 drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Leon Romanovsky June 13, 2024, 7:03 p.m. UTC | #1

On Thu, Jun 13, 2024 at 05:42:52PM +0530, Anand Khoje wrote:
> In non FLR context, at times CX-5 requests release of ~8 million FW pages.
> This needs humongous number of cmd mailboxes, which to be released once
> the pages are reclaimed. Release of humongous number of cmd mailboxes is
> consuming cpu time running into many seconds. Which with non preemptible
> kernels is leading to critical process starving on that cpu’s RQ.
> To alleviate this, this change restricts the total number of pages
> a worker will try to reclaim maximum 50K pages in one go.
> The limit 50K is aligned with the current firmware capacity/limit of
> releasing 50K pages at once per MLX5_CMD_OP_MANAGE_PAGES + MLX5_PAGES_TAKE
> device command.
> 
> Our tests have shown significant benefit of this change in terms of
> time consumed by dma_pool_free().
> During a test where an event was raised by HCA
> to release 1.3 Million pages, following observations were made:
> 
> - Without this change:
> Number of mailbox messages allocated was around 20K, to accommodate
> the DMA addresses of 1.3 million pages.
> The average time spent by dma_pool_free() to free the DMA pool is between
> 16 usec to 32 usec.
>            value  ------------- Distribution ------------- count
>              256 |                                         0
>              512 |@                                        287
>             1024 |@@@                                      1332
>             2048 |@                                        656
>             4096 |@@@@@                                    2599
>             8192 |@@@@@@@@@@                               4755
>            16384 |@@@@@@@@@@@@@@@                          7545
>            32768 |@@@@@                                    2501
>            65536 |                                         0
> 
> - With this change:
> Number of mailbox messages allocated was around 800; this was to
> accommodate DMA addresses of only 50K pages.
> The average time spent by dma_pool_free() to free the DMA pool in this case
> lies between 1 usec to 2 usec.
>            value  ------------- Distribution ------------- count
>              256 |                                         0
>              512 |@@@@@@@@@@@@@@@@@@                       346
>             1024 |@@@@@@@@@@@@@@@@@@@@@@                   435
>             2048 |                                         0
>             4096 |                                         0
>             8192 |                                         1
>            16384 |                                         0
> 
> Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
> ---
> Changes in v2:
>  - In v1, CPUs were yielded if more than 2 msec are spent in
>    mlx5_free_cmd_msg(). The approach to limit the time spent is changed
>    in this version.
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> index 1b38397..b1cf97d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> @@ -482,12 +482,16 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u32 func_id, int npages,
>  	return err;
>  }
>  
> +#define MAX_RECLAIM_NPAGES -50000
>  static void pages_work_handler(struct work_struct *work)
>  {
>  	struct mlx5_pages_req *req = container_of(work, struct mlx5_pages_req, work);
>  	struct mlx5_core_dev *dev = req->dev;
>  	int err = 0;
>  
> +	if (req->npages < MAX_RECLAIM_NPAGES)
> +		req->npages = MAX_RECLAIM_NPAGES;

I like this change more than previous variant with yield.
Regarding the patch:
1. Please limit the number of pages in req_pages_handler() and not int pages_work_handler().
2. Patch title should be "net/mlx5: Reclaim max 50K pages at once" and not "RDMA...".
3. You should run get_maintainer.pl script to find the right maintainers and add them to the TO or CC list.

And I still think that you will get better performance by parallelizing the reclaim process.

Thanks

> +
>  	if (req->release_all)
>  		release_all_pages(dev, req->func_id, req->ec_function);
>  	else if (req->npages < 0)
> -- 
> 1.8.3.1
> 
>

Anand Khoje June 14, 2024, 5:11 a.m. UTC | #2

On 6/14/24 00:33, Leon Romanovsky wrote:
> On Thu, Jun 13, 2024 at 05:42:52PM +0530, Anand Khoje wrote:
>> In non FLR context, at times CX-5 requests release of ~8 million FW pages.
>> This needs humongous number of cmd mailboxes, which to be released once
>> the pages are reclaimed. Release of humongous number of cmd mailboxes is
>> consuming cpu time running into many seconds. Which with non preemptible
>> kernels is leading to critical process starving on that cpu’s RQ.
>> To alleviate this, this change restricts the total number of pages
>> a worker will try to reclaim maximum 50K pages in one go.
>> The limit 50K is aligned with the current firmware capacity/limit of
>> releasing 50K pages at once per MLX5_CMD_OP_MANAGE_PAGES + MLX5_PAGES_TAKE
>> device command.
>>
>> Our tests have shown significant benefit of this change in terms of
>> time consumed by dma_pool_free().
>> During a test where an event was raised by HCA
>> to release 1.3 Million pages, following observations were made:
>>
>> - Without this change:
>> Number of mailbox messages allocated was around 20K, to accommodate
>> the DMA addresses of 1.3 million pages.
>> The average time spent by dma_pool_free() to free the DMA pool is between
>> 16 usec to 32 usec.
>>             value  ------------- Distribution ------------- count
>>               256 |                                         0
>>               512 |@                                        287
>>              1024 |@@@                                      1332
>>              2048 |@                                        656
>>              4096 |@@@@@                                    2599
>>              8192 |@@@@@@@@@@                               4755
>>             16384 |@@@@@@@@@@@@@@@                          7545
>>             32768 |@@@@@                                    2501
>>             65536 |                                         0
>>
>> - With this change:
>> Number of mailbox messages allocated was around 800; this was to
>> accommodate DMA addresses of only 50K pages.
>> The average time spent by dma_pool_free() to free the DMA pool in this case
>> lies between 1 usec to 2 usec.
>>             value  ------------- Distribution ------------- count
>>               256 |                                         0
>>               512 |@@@@@@@@@@@@@@@@@@                       346
>>              1024 |@@@@@@@@@@@@@@@@@@@@@@                   435
>>              2048 |                                         0
>>              4096 |                                         0
>>              8192 |                                         1
>>             16384 |                                         0
>>
>> Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
>> ---
>> Changes in v2:
>>   - In v1, CPUs were yielded if more than 2 msec are spent in
>>     mlx5_free_cmd_msg(). The approach to limit the time spent is changed
>>     in this version.
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> index 1b38397..b1cf97d 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
>> @@ -482,12 +482,16 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u32 func_id, int npages,
>>   	return err;
>>   }
>>   
>> +#define MAX_RECLAIM_NPAGES -50000
>>   static void pages_work_handler(struct work_struct *work)
>>   {
>>   	struct mlx5_pages_req *req = container_of(work, struct mlx5_pages_req, work);
>>   	struct mlx5_core_dev *dev = req->dev;
>>   	int err = 0;
>>   
>> +	if (req->npages < MAX_RECLAIM_NPAGES)
>> +		req->npages = MAX_RECLAIM_NPAGES;
> I like this change more than previous variant with yield.
> Regarding the patch:
> 1. Please limit the number of pages in req_pages_handler() and not int pages_work_handler().
> 2. Patch title should be "net/mlx5: Reclaim max 50K pages at once" and not "RDMA...".
> 3. You should run get_maintainer.pl script to find the right maintainers and add them to the TO or CC list.
>
> And I still think that you will get better performance by parallelizing the reclaim process.
>
> Thanks

Hi Leon,

Thanks for the comments. I will make the changes and resend a v3.

-Anand

>> +
>>   	if (req->release_all)
>>   		release_all_pages(dev, req->func_id, req->ec_function);
>>   	else if (req->npages < 0)
>> -- 
>> 1.8.3.1
>>
>>

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index 1b38397..b1cf97d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
@@ -482,12 +482,16 @@  static int reclaim_pages(struct mlx5_core_dev *dev, u32 func_id, int npages,
 	return err;
 }
 
+#define MAX_RECLAIM_NPAGES -50000
 static void pages_work_handler(struct work_struct *work)
 {
 	struct mlx5_pages_req *req = container_of(work, struct mlx5_pages_req, work);
 	struct mlx5_core_dev *dev = req->dev;
 	int err = 0;
 
+	if (req->npages < MAX_RECLAIM_NPAGES)
+		req->npages = MAX_RECLAIM_NPAGES;
+
 	if (req->release_all)
 		release_all_pages(dev, req->func_id, req->ec_function);
 	else if (req->npages < 0)