diff mbox series

[net-next,v3] mlx5/core: Schedule EQ comp tasklet only if necessary

Message ID 20241031163436.3732948-1-csander@purestorage.com (mailing list archive)
State Superseded
Headers show
Series [net-next,v3] mlx5/core: Schedule EQ comp tasklet only if necessary | expand

Commit Message

Caleb Sander Oct. 31, 2024, 4:34 p.m. UTC
Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
whose completions don't need to be processed in tasklet context, this
adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
time spent on the tasklet_trylock() in tasklet_action_common(), with a
smaller amount spent on the atomic operations in tasklet_schedule(),
tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
TCP completions are handled by mlx5e_completion_event(), which schedules
NAPI to poll the queue, so they don't need tasklet processing.

Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
to be processed in tasklet context, so it can schedule the tasklet. CQs
that need tasklet processing have their interrupt comp handler set to
mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
don't need tasklet processing won't schedule the tasklet. To avoid
scheduling the tasklet multiple times during the same interrupt, only
schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
queue was empty before the new CQ was pushed to it.

The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
may add a small cost for the userspace Infiniband CQs whose completions
are processed in tasklet context. But this seems worth it to avoid the
tasklet overhead for CQs that don't need it.

Note that the mlx4 driver works the same way: it schedules the tasklet
in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
---
v3: revise commit message
v2: reorder variable declarations, describe CPU profile results

 drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +----
 2 files changed, 6 insertions(+), 4 deletions(-)

Comments

Paolo Abeni Nov. 5, 2024, 12:21 p.m. UTC | #1
On 10/31/24 17:34, Caleb Sander Mateos wrote:
> Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
> to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
> whose completions don't need to be processed in tasklet context, this
> adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
> time spent on the tasklet_trylock() in tasklet_action_common(), with a
> smaller amount spent on the atomic operations in tasklet_schedule(),
> tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
> TCP completions are handled by mlx5e_completion_event(), which schedules
> NAPI to poll the queue, so they don't need tasklet processing.
> 
> Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
> overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
> to be processed in tasklet context, so it can schedule the tasklet. CQs
> that need tasklet processing have their interrupt comp handler set to
> mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
> don't need tasklet processing won't schedule the tasklet. To avoid
> scheduling the tasklet multiple times during the same interrupt, only
> schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
> queue was empty before the new CQ was pushed to it.
> 
> The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
> may add a small cost for the userspace Infiniband CQs whose completions
> are processed in tasklet context. But this seems worth it to avoid the
> tasklet overhead for CQs that don't need it.
> 
> Note that the mlx4 driver works the same way: it schedules the tasklet
> in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.
> 
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> Reviewed-by: Parav Pandit <parav@nvidia.com>

@Saeed, @Leon, @Tariq: I assume you will apply this one and include in
the next mlx5 PR. please correct me if I'm wrong.

Thanks,

Paolo
Tariq Toukan Nov. 5, 2024, 5:28 p.m. UTC | #2
On 05/11/2024 14:21, Paolo Abeni wrote:
> On 10/31/24 17:34, Caleb Sander Mateos wrote:
>> Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
>> to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
>> whose completions don't need to be processed in tasklet context, this
>> adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
>> time spent on the tasklet_trylock() in tasklet_action_common(), with a
>> smaller amount spent on the atomic operations in tasklet_schedule(),
>> tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
>> TCP completions are handled by mlx5e_completion_event(), which schedules
>> NAPI to poll the queue, so they don't need tasklet processing.
>>
>> Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
>> overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
>> to be processed in tasklet context, so it can schedule the tasklet. CQs
>> that need tasklet processing have their interrupt comp handler set to
>> mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
>> don't need tasklet processing won't schedule the tasklet. To avoid
>> scheduling the tasklet multiple times during the same interrupt, only
>> schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
>> queue was empty before the new CQ was pushed to it.
>>
>> The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
>> may add a small cost for the userspace Infiniband CQs whose completions
>> are processed in tasklet context. But this seems worth it to avoid the
>> tasklet overhead for CQs that don't need it.
>>
>> Note that the mlx4 driver works the same way: it schedules the tasklet
>> in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.
>>
>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>> Reviewed-by: Parav Pandit <parav@nvidia.com>
> 
> @Saeed, @Leon, @Tariq: I assume you will apply this one and include in
> the next mlx5 PR. please correct me if I'm wrong.
> 

Hi Paolo,

I am doing the mlx5 core/en maintainer work right now. I work 
differently to Saeed, as I do not have a kernel.org branch of my own 
(nor use Saeed's ofcourse),

Please consider applying this one and any others once I acknowledge/review.

Acked-by: Tariq Toukan <tariqt@nvidia.com>

Regards,
Tariq
Saeed Mahameed Nov. 5, 2024, 6:56 p.m. UTC | #3
On 31 Oct 10:34, Caleb Sander Mateos wrote:
>Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
>to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
>whose completions don't need to be processed in tasklet context, this
>adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
>time spent on the tasklet_trylock() in tasklet_action_common(), with a
>smaller amount spent on the atomic operations in tasklet_schedule(),
>tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
>TCP completions are handled by mlx5e_completion_event(), which schedules
>NAPI to poll the queue, so they don't need tasklet processing.
>
>Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
>overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
>to be processed in tasklet context, so it can schedule the tasklet. CQs
>that need tasklet processing have their interrupt comp handler set to
>mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
>don't need tasklet processing won't schedule the tasklet. To avoid
>scheduling the tasklet multiple times during the same interrupt, only
>schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
>queue was empty before the new CQ was pushed to it.
>
>The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
>may add a small cost for the userspace Infiniband CQs whose completions
>are processed in tasklet context. But this seems worth it to avoid the
>tasklet overhead for CQs that don't need it.
>
>Note that the mlx4 driver works the same way: it schedules the tasklet
>in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.
>
>Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>Reviewed-by: Parav Pandit <parav@nvidia.com>
>---
>v3: revise commit message
>v2: reorder variable declarations, describe CPU profile results
>
> drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++
> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +----
> 2 files changed, 6 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
>index 4caa1b6f40ba..25f3b26db729 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
>@@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct *t)
> static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq,
> 				   struct mlx5_eqe *eqe)
> {
> 	unsigned long flags;
> 	struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
>+	bool schedule_tasklet = false;
>
> 	spin_lock_irqsave(&tasklet_ctx->lock, flags);
> 	/* When migrating CQs between EQs will be implemented, please note
> 	 * that you need to sync this point. It is possible that
> 	 * while migrating a CQ, completions on the old EQs could
> 	 * still arrive.
> 	 */
> 	if (list_empty_careful(&cq->tasklet_ctx.list)) {
> 		mlx5_cq_hold(cq);

The condition here is counter intuitive, please add a comment that relates
to the tasklet routine mlx5_cq_tasklet_cb, something like.
/* If this list isn't empty, the tasklet is already scheduled, and not yet
  * executing the list, the spinlock here guarantees the addition of this CQ
  * will be seen by the next execution, so rescheduling the tasklet is not
  * required */ 

One other way to do this, is to flag tasklet_ctx.sched_flag = true, inside
mlx5_add_cq_to_tasklet, and then schedule once at the end of eq irq processing 
if (tasklet_ctx.sched_flag == true). to avoid "too" early scheduling, but
since the tasklet can't run until the irq handler returns, I think your
solution shouldn't suffer from "too" early scheduling .. 

Acked-by: Saeed Mahameed <saeedm@nvidia.com>


>+		schedule_tasklet = list_empty(&tasklet_ctx->list);
> 		list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> 	}
> 	spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
>+
>+	if (schedule_tasklet)
>+		tasklet_schedule(&tasklet_ctx->task);
> }
>
> /* Callers must verify outbox status in case of err */
> int mlx5_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq,
> 		   u32 *in, int inlen, u32 *out, int outlen)
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
>index 859dcf09b770..3fd2091c11c8 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
>@@ -112,14 +112,14 @@ static int mlx5_eq_comp_int(struct notifier_block *nb,
> 	struct mlx5_eq_comp *eq_comp =
> 		container_of(nb, struct mlx5_eq_comp, irq_nb);
> 	struct mlx5_eq *eq = &eq_comp->core;
> 	struct mlx5_eqe *eqe;
> 	int num_eqes = 0;
>-	u32 cqn = -1;
>
> 	while ((eqe = next_eqe_sw(eq))) {
> 		struct mlx5_core_cq *cq;
>+		u32 cqn;
>
> 		/* Make sure we read EQ entry contents after we've
> 		 * checked the ownership bit.
> 		 */
> 		dma_rmb();
>@@ -142,13 +142,10 @@ static int mlx5_eq_comp_int(struct notifier_block *nb,
> 			break;
> 	}
>
> 	eq_update_ci(eq, 1);
>
>-	if (cqn != -1)
>-		tasklet_schedule(&eq_comp->tasklet_ctx.task);
>-
> 	return 0;
> }
>
> /* Some architectures don't latch interrupts when they are disabled, so using
>  * mlx5_eq_poll_irq_disabled could end up losing interrupts while trying to
>-- 
>2.45.2
>
>
Caleb Sander Nov. 5, 2024, 8:39 p.m. UTC | #4
On Tue, Nov 5, 2024 at 10:56 AM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On 31 Oct 10:34, Caleb Sander Mateos wrote:
> >Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
> >to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
> >whose completions don't need to be processed in tasklet context, this
> >adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
> >time spent on the tasklet_trylock() in tasklet_action_common(), with a
> >smaller amount spent on the atomic operations in tasklet_schedule(),
> >tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
> >TCP completions are handled by mlx5e_completion_event(), which schedules
> >NAPI to poll the queue, so they don't need tasklet processing.
> >
> >Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
> >overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
> >to be processed in tasklet context, so it can schedule the tasklet. CQs
> >that need tasklet processing have their interrupt comp handler set to
> >mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
> >don't need tasklet processing won't schedule the tasklet. To avoid
> >scheduling the tasklet multiple times during the same interrupt, only
> >schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
> >queue was empty before the new CQ was pushed to it.
> >
> >The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
> >may add a small cost for the userspace Infiniband CQs whose completions
> >are processed in tasklet context. But this seems worth it to avoid the
> >tasklet overhead for CQs that don't need it.
> >
> >Note that the mlx4 driver works the same way: it schedules the tasklet
> >in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.
> >
> >Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >Reviewed-by: Parav Pandit <parav@nvidia.com>
> >---
> >v3: revise commit message
> >v2: reorder variable declarations, describe CPU profile results
> >
> > drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++
> > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +----
> > 2 files changed, 6 insertions(+), 4 deletions(-)
> >
> >diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> >index 4caa1b6f40ba..25f3b26db729 100644
> >--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> >+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> >@@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct *t)
> > static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq,
> >                                  struct mlx5_eqe *eqe)
> > {
> >       unsigned long flags;
> >       struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> >+      bool schedule_tasklet = false;
> >
> >       spin_lock_irqsave(&tasklet_ctx->lock, flags);
> >       /* When migrating CQs between EQs will be implemented, please note
> >        * that you need to sync this point. It is possible that
> >        * while migrating a CQ, completions on the old EQs could
> >        * still arrive.
> >        */
> >       if (list_empty_careful(&cq->tasklet_ctx.list)) {
> >               mlx5_cq_hold(cq);
>
> The condition here is counter intuitive, please add a comment that relates
> to the tasklet routine mlx5_cq_tasklet_cb, something like.
> /* If this list isn't empty, the tasklet is already scheduled, and not yet
>   * executing the list, the spinlock here guarantees the addition of this CQ
>   * will be seen by the next execution, so rescheduling the tasklet is not
>   * required */

Sure, will send out a v4.

>
> One other way to do this, is to flag tasklet_ctx.sched_flag = true, inside
> mlx5_add_cq_to_tasklet, and then schedule once at the end of eq irq processing
> if (tasklet_ctx.sched_flag == true). to avoid "too" early scheduling, but
> since the tasklet can't run until the irq handler returns, I think your
> solution shouldn't suffer from "too" early scheduling ..

Right, that was my thinking behind the list_empty(&tasklet_ctx->list) check.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index 4caa1b6f40ba..25f3b26db729 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -69,22 +69,27 @@  void mlx5_cq_tasklet_cb(struct tasklet_struct *t)
 static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq,
 				   struct mlx5_eqe *eqe)
 {
 	unsigned long flags;
 	struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
+	bool schedule_tasklet = false;
 
 	spin_lock_irqsave(&tasklet_ctx->lock, flags);
 	/* When migrating CQs between EQs will be implemented, please note
 	 * that you need to sync this point. It is possible that
 	 * while migrating a CQ, completions on the old EQs could
 	 * still arrive.
 	 */
 	if (list_empty_careful(&cq->tasklet_ctx.list)) {
 		mlx5_cq_hold(cq);
+		schedule_tasklet = list_empty(&tasklet_ctx->list);
 		list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
 	}
 	spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
+
+	if (schedule_tasklet)
+		tasklet_schedule(&tasklet_ctx->task);
 }
 
 /* Callers must verify outbox status in case of err */
 int mlx5_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq,
 		   u32 *in, int inlen, u32 *out, int outlen)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 859dcf09b770..3fd2091c11c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -112,14 +112,14 @@  static int mlx5_eq_comp_int(struct notifier_block *nb,
 	struct mlx5_eq_comp *eq_comp =
 		container_of(nb, struct mlx5_eq_comp, irq_nb);
 	struct mlx5_eq *eq = &eq_comp->core;
 	struct mlx5_eqe *eqe;
 	int num_eqes = 0;
-	u32 cqn = -1;
 
 	while ((eqe = next_eqe_sw(eq))) {
 		struct mlx5_core_cq *cq;
+		u32 cqn;
 
 		/* Make sure we read EQ entry contents after we've
 		 * checked the ownership bit.
 		 */
 		dma_rmb();
@@ -142,13 +142,10 @@  static int mlx5_eq_comp_int(struct notifier_block *nb,
 			break;
 	}
 
 	eq_update_ci(eq, 1);
 
-	if (cqn != -1)
-		tasklet_schedule(&eq_comp->tasklet_ctx.task);
-
 	return 0;
 }
 
 /* Some architectures don't latch interrupts when they are disabled, so using
  * mlx5_eq_poll_irq_disabled could end up losing interrupts while trying to