Message ID | 20241031163436.3732948-1-csander@purestorage.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [net-next,v3] mlx5/core: Schedule EQ comp tasklet only if necessary | expand |
On 10/31/24 17:34, Caleb Sander Mateos wrote: > Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet > to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs > whose completions don't need to be processed in tasklet context, this > adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU > time spent on the tasklet_trylock() in tasklet_action_common(), with a > smaller amount spent on the atomic operations in tasklet_schedule(), > tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb(). > TCP completions are handled by mlx5e_completion_event(), which schedules > NAPI to poll the queue, so they don't need tasklet processing. > > Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this > overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs > to be processed in tasklet context, so it can schedule the tasklet. CQs > that need tasklet processing have their interrupt comp handler set to > mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that > don't need tasklet processing won't schedule the tasklet. To avoid > scheduling the tasklet multiple times during the same interrupt, only > schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work > queue was empty before the new CQ was pushed to it. > > The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE, > may add a small cost for the userspace Infiniband CQs whose completions > are processed in tasklet context. But this seems worth it to avoid the > tasklet overhead for CQs that don't need it. > > Note that the mlx4 driver works the same way: it schedules the tasklet > in mlx4_add_cq_to_tasklet() and only if the work queue was empty before. > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > Reviewed-by: Parav Pandit <parav@nvidia.com> @Saeed, @Leon, @Tariq: I assume you will apply this one and include in the next mlx5 PR. please correct me if I'm wrong. Thanks, Paolo
On 05/11/2024 14:21, Paolo Abeni wrote: > On 10/31/24 17:34, Caleb Sander Mateos wrote: >> Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet >> to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs >> whose completions don't need to be processed in tasklet context, this >> adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU >> time spent on the tasklet_trylock() in tasklet_action_common(), with a >> smaller amount spent on the atomic operations in tasklet_schedule(), >> tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb(). >> TCP completions are handled by mlx5e_completion_event(), which schedules >> NAPI to poll the queue, so they don't need tasklet processing. >> >> Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this >> overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs >> to be processed in tasklet context, so it can schedule the tasklet. CQs >> that need tasklet processing have their interrupt comp handler set to >> mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that >> don't need tasklet processing won't schedule the tasklet. To avoid >> scheduling the tasklet multiple times during the same interrupt, only >> schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work >> queue was empty before the new CQ was pushed to it. >> >> The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE, >> may add a small cost for the userspace Infiniband CQs whose completions >> are processed in tasklet context. But this seems worth it to avoid the >> tasklet overhead for CQs that don't need it. >> >> Note that the mlx4 driver works the same way: it schedules the tasklet >> in mlx4_add_cq_to_tasklet() and only if the work queue was empty before. >> >> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> >> Reviewed-by: Parav Pandit <parav@nvidia.com> > > @Saeed, @Leon, @Tariq: I assume you will apply this one and include in > the next mlx5 PR. please correct me if I'm wrong. > Hi Paolo, I am doing the mlx5 core/en maintainer work right now. I work differently to Saeed, as I do not have a kernel.org branch of my own (nor use Saeed's ofcourse), Please consider applying this one and any others once I acknowledge/review. Acked-by: Tariq Toukan <tariqt@nvidia.com> Regards, Tariq
On 31 Oct 10:34, Caleb Sander Mateos wrote: >Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet >to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs >whose completions don't need to be processed in tasklet context, this >adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU >time spent on the tasklet_trylock() in tasklet_action_common(), with a >smaller amount spent on the atomic operations in tasklet_schedule(), >tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb(). >TCP completions are handled by mlx5e_completion_event(), which schedules >NAPI to poll the queue, so they don't need tasklet processing. > >Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this >overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs >to be processed in tasklet context, so it can schedule the tasklet. CQs >that need tasklet processing have their interrupt comp handler set to >mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that >don't need tasklet processing won't schedule the tasklet. To avoid >scheduling the tasklet multiple times during the same interrupt, only >schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work >queue was empty before the new CQ was pushed to it. > >The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE, >may add a small cost for the userspace Infiniband CQs whose completions >are processed in tasklet context. But this seems worth it to avoid the >tasklet overhead for CQs that don't need it. > >Note that the mlx4 driver works the same way: it schedules the tasklet >in mlx4_add_cq_to_tasklet() and only if the work queue was empty before. > >Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> >Reviewed-by: Parav Pandit <parav@nvidia.com> >--- >v3: revise commit message >v2: reorder variable declarations, describe CPU profile results > > drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++ > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +---- > 2 files changed, 6 insertions(+), 4 deletions(-) > >diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c >index 4caa1b6f40ba..25f3b26db729 100644 >--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c >+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c >@@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct *t) > static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq, > struct mlx5_eqe *eqe) > { > unsigned long flags; > struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv; >+ bool schedule_tasklet = false; > > spin_lock_irqsave(&tasklet_ctx->lock, flags); > /* When migrating CQs between EQs will be implemented, please note > * that you need to sync this point. It is possible that > * while migrating a CQ, completions on the old EQs could > * still arrive. > */ > if (list_empty_careful(&cq->tasklet_ctx.list)) { > mlx5_cq_hold(cq); The condition here is counter intuitive, please add a comment that relates to the tasklet routine mlx5_cq_tasklet_cb, something like. /* If this list isn't empty, the tasklet is already scheduled, and not yet * executing the list, the spinlock here guarantees the addition of this CQ * will be seen by the next execution, so rescheduling the tasklet is not * required */ One other way to do this, is to flag tasklet_ctx.sched_flag = true, inside mlx5_add_cq_to_tasklet, and then schedule once at the end of eq irq processing if (tasklet_ctx.sched_flag == true). to avoid "too" early scheduling, but since the tasklet can't run until the irq handler returns, I think your solution shouldn't suffer from "too" early scheduling .. Acked-by: Saeed Mahameed <saeedm@nvidia.com> >+ schedule_tasklet = list_empty(&tasklet_ctx->list); > list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list); > } > spin_unlock_irqrestore(&tasklet_ctx->lock, flags); >+ >+ if (schedule_tasklet) >+ tasklet_schedule(&tasklet_ctx->task); > } > > /* Callers must verify outbox status in case of err */ > int mlx5_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq, > u32 *in, int inlen, u32 *out, int outlen) >diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c >index 859dcf09b770..3fd2091c11c8 100644 >--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c >+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c >@@ -112,14 +112,14 @@ static int mlx5_eq_comp_int(struct notifier_block *nb, > struct mlx5_eq_comp *eq_comp = > container_of(nb, struct mlx5_eq_comp, irq_nb); > struct mlx5_eq *eq = &eq_comp->core; > struct mlx5_eqe *eqe; > int num_eqes = 0; >- u32 cqn = -1; > > while ((eqe = next_eqe_sw(eq))) { > struct mlx5_core_cq *cq; >+ u32 cqn; > > /* Make sure we read EQ entry contents after we've > * checked the ownership bit. > */ > dma_rmb(); >@@ -142,13 +142,10 @@ static int mlx5_eq_comp_int(struct notifier_block *nb, > break; > } > > eq_update_ci(eq, 1); > >- if (cqn != -1) >- tasklet_schedule(&eq_comp->tasklet_ctx.task); >- > return 0; > } > > /* Some architectures don't latch interrupts when they are disabled, so using > * mlx5_eq_poll_irq_disabled could end up losing interrupts while trying to >-- >2.45.2 > >
On Tue, Nov 5, 2024 at 10:56 AM Saeed Mahameed <saeed@kernel.org> wrote: > > On 31 Oct 10:34, Caleb Sander Mateos wrote: > >Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet > >to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs > >whose completions don't need to be processed in tasklet context, this > >adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU > >time spent on the tasklet_trylock() in tasklet_action_common(), with a > >smaller amount spent on the atomic operations in tasklet_schedule(), > >tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb(). > >TCP completions are handled by mlx5e_completion_event(), which schedules > >NAPI to poll the queue, so they don't need tasklet processing. > > > >Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this > >overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs > >to be processed in tasklet context, so it can schedule the tasklet. CQs > >that need tasklet processing have their interrupt comp handler set to > >mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that > >don't need tasklet processing won't schedule the tasklet. To avoid > >scheduling the tasklet multiple times during the same interrupt, only > >schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work > >queue was empty before the new CQ was pushed to it. > > > >The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE, > >may add a small cost for the userspace Infiniband CQs whose completions > >are processed in tasklet context. But this seems worth it to avoid the > >tasklet overhead for CQs that don't need it. > > > >Note that the mlx4 driver works the same way: it schedules the tasklet > >in mlx4_add_cq_to_tasklet() and only if the work queue was empty before. > > > >Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > >Reviewed-by: Parav Pandit <parav@nvidia.com> > >--- > >v3: revise commit message > >v2: reorder variable declarations, describe CPU profile results > > > > drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++ > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +---- > > 2 files changed, 6 insertions(+), 4 deletions(-) > > > >diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c > >index 4caa1b6f40ba..25f3b26db729 100644 > >--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c > >+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c > >@@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct *t) > > static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq, > > struct mlx5_eqe *eqe) > > { > > unsigned long flags; > > struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv; > >+ bool schedule_tasklet = false; > > > > spin_lock_irqsave(&tasklet_ctx->lock, flags); > > /* When migrating CQs between EQs will be implemented, please note > > * that you need to sync this point. It is possible that > > * while migrating a CQ, completions on the old EQs could > > * still arrive. > > */ > > if (list_empty_careful(&cq->tasklet_ctx.list)) { > > mlx5_cq_hold(cq); > > The condition here is counter intuitive, please add a comment that relates > to the tasklet routine mlx5_cq_tasklet_cb, something like. > /* If this list isn't empty, the tasklet is already scheduled, and not yet > * executing the list, the spinlock here guarantees the addition of this CQ > * will be seen by the next execution, so rescheduling the tasklet is not > * required */ Sure, will send out a v4. > > One other way to do this, is to flag tasklet_ctx.sched_flag = true, inside > mlx5_add_cq_to_tasklet, and then schedule once at the end of eq irq processing > if (tasklet_ctx.sched_flag == true). to avoid "too" early scheduling, but > since the tasklet can't run until the irq handler returns, I think your > solution shouldn't suffer from "too" early scheduling .. Right, that was my thinking behind the list_empty(&tasklet_ctx->list) check.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c b/drivers/net/ethernet/mellanox/mlx5/core/cq.c index 4caa1b6f40ba..25f3b26db729 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c @@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct *t) static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq, struct mlx5_eqe *eqe) { unsigned long flags; struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv; + bool schedule_tasklet = false; spin_lock_irqsave(&tasklet_ctx->lock, flags); /* When migrating CQs between EQs will be implemented, please note * that you need to sync this point. It is possible that * while migrating a CQ, completions on the old EQs could * still arrive. */ if (list_empty_careful(&cq->tasklet_ctx.list)) { mlx5_cq_hold(cq); + schedule_tasklet = list_empty(&tasklet_ctx->list); list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list); } spin_unlock_irqrestore(&tasklet_ctx->lock, flags); + + if (schedule_tasklet) + tasklet_schedule(&tasklet_ctx->task); } /* Callers must verify outbox status in case of err */ int mlx5_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq, u32 *in, int inlen, u32 *out, int outlen) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index 859dcf09b770..3fd2091c11c8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -112,14 +112,14 @@ static int mlx5_eq_comp_int(struct notifier_block *nb, struct mlx5_eq_comp *eq_comp = container_of(nb, struct mlx5_eq_comp, irq_nb); struct mlx5_eq *eq = &eq_comp->core; struct mlx5_eqe *eqe; int num_eqes = 0; - u32 cqn = -1; while ((eqe = next_eqe_sw(eq))) { struct mlx5_core_cq *cq; + u32 cqn; /* Make sure we read EQ entry contents after we've * checked the ownership bit. */ dma_rmb(); @@ -142,13 +142,10 @@ static int mlx5_eq_comp_int(struct notifier_block *nb, break; } eq_update_ci(eq, 1); - if (cqn != -1) - tasklet_schedule(&eq_comp->tasklet_ctx.task); - return 0; } /* Some architectures don't latch interrupts when they are disabled, so using * mlx5_eq_poll_irq_disabled could end up losing interrupts while trying to