Message ID | 1717500963-1108-1-git-send-email-kotaranov@linux.microsoft.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | [rdma-next,1/1] RDMA/mana_ib: process QP error events | expand |
On Tue, Jun 04, 2024 at 04:36:03AM -0700, Konstantin Taranov wrote: > From: Konstantin Taranov <kotaranov@microsoft.com> > > Process QP fatal events from the error event queue. > For that, find the QP, using QPN from the event, and then call its > event_handler. To find the QPs, store created RC QPs in an xarray. > > Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> > Reviewed-by: Wei Hu <weh@microsoft.com> > --- > drivers/infiniband/hw/mana/device.c | 3 ++ > drivers/infiniband/hw/mana/main.c | 37 ++++++++++++++++++- > drivers/infiniband/hw/mana/mana_ib.h | 4 ++ > drivers/infiniband/hw/mana/qp.c | 11 ++++++ > .../net/ethernet/microsoft/mana/gdma_main.c | 1 + > include/net/mana/gdma.h | 1 + > 6 files changed, 55 insertions(+), 2 deletions(-) <...> > +static void > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct gdma_event *event) > +{ > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > + struct mana_ib_qp *qp; > + struct ib_event ev; > + unsigned long flag; > + u32 qpn; > + > + switch (event->type) { > + case GDMA_EQE_RNIC_QP_FATAL: > + qpn = event->details[0]; > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > + qp = xa_load(&mdev->qp_table_rq, qpn); > + if (qp) > + refcount_inc(&qp->refcount); > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > + if (!qp) > + break; > + if (qp->ibqp.event_handler) { > + ev.device = qp->ibqp.device; > + ev.element.qp = &qp->ibqp; > + ev.event = IB_EVENT_QP_FATAL; > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > + } > + if (refcount_dec_and_test(&qp->refcount)) > + complete(&qp->free); > + break; > + default: > + break; > + } > +} <...> > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct mana_ib_qp *qp, struct ib_udata *udata) > container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); > int i; > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > + if (refcount_dec_and_test(&qp->refcount)) > + complete(&qp->free); > + wait_for_completion(&qp->free); This flow is unclear to me. You are destroying the QP and need to make sure that mana_ib_event_handler is not running at that point and not mess with refcount and complete. Thanks
> > +static void > > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct > > +gdma_event *event) { > > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > > + struct mana_ib_qp *qp; > > + struct ib_event ev; > > + unsigned long flag; > > + u32 qpn; > > + > > + switch (event->type) { > > + case GDMA_EQE_RNIC_QP_FATAL: > > + qpn = event->details[0]; > > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > > + qp = xa_load(&mdev->qp_table_rq, qpn); > > + if (qp) > > + refcount_inc(&qp->refcount); > > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > > + if (!qp) > > + break; > > + if (qp->ibqp.event_handler) { > > + ev.device = qp->ibqp.device; > > + ev.element.qp = &qp->ibqp; > > + ev.event = IB_EVENT_QP_FATAL; > > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > > + } > > + if (refcount_dec_and_test(&qp->refcount)) > > + complete(&qp->free); > > + break; > > + default: > > + break; > > + } > > +} > > <...> > > > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct > mana_ib_qp *qp, struct ib_udata *udata) > > container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); > > int i; > > > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > > + if (refcount_dec_and_test(&qp->refcount)) > > + complete(&qp->free); > > + wait_for_completion(&qp->free); > > This flow is unclear to me. You are destroying the QP and need to make sure > that mana_ib_event_handler is not running at that point and not mess with > refcount and complete. Hi, Leon. Thanks for the concern. Here is the clarification: The flow is similar to what mlx5 does with mlx5_get_rsc and mlx5_core_put_rsc. When we get a QP resource, we increment the counter while holding the xa lock. So, when we destroy a QP, the code first removes the QP from the xa to ensure that nobody can get it. And then we check whether mana_ib_event_handler is holding it with refcount_dec_and_test. If the QP is held, then we wait for the mana_ib_event_handler to call complete. Once the wait returns, it is impossible to get the QP referenced, since it is not in the xa and all references have been removed. So now we can release the QP in HW, so the QPN can be assigned to new QPs. Leon, have you spotted a bug? Or just wanted to understand the flow? Thanks > > Thanks
> Subject: [PATCH rdma-next 1/1] RDMA/mana_ib: process QP error events > > From: Konstantin Taranov <kotaranov@microsoft.com> > > Process QP fatal events from the error event queue. > For that, find the QP, using QPN from the event, and then call its event_handler. > To find the QPs, store created RC QPs in an xarray. > > Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> > Reviewed-by: Wei Hu <weh@microsoft.com> > --- > drivers/infiniband/hw/mana/device.c | 3 ++ > drivers/infiniband/hw/mana/main.c | 37 ++++++++++++++++++- > drivers/infiniband/hw/mana/mana_ib.h | 4 ++ > drivers/infiniband/hw/mana/qp.c | 11 ++++++ > .../net/ethernet/microsoft/mana/gdma_main.c | 1 + > include/net/mana/gdma.h | 1 + > 6 files changed, 55 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/hw/mana/device.c > b/drivers/infiniband/hw/mana/device.c > index 9a7da2e..9eb714e 100644 > --- a/drivers/infiniband/hw/mana/device.c > +++ b/drivers/infiniband/hw/mana/device.c > @@ -126,6 +126,7 @@ static int mana_ib_probe(struct auxiliary_device *adev, > if (ret) > goto destroy_eqs; > > + xa_init_flags(&dev->qp_table_rq, XA_FLAGS_LOCK_IRQ); > ret = mana_ib_gd_config_mac(dev, ADDR_OP_ADD, mac_addr); > if (ret) { > ibdev_err(&dev->ib_dev, "Failed to add Mac address, ret %d", > @@ -143,6 +144,7 @@ static int mana_ib_probe(struct auxiliary_device *adev, > return 0; > > destroy_rnic: > + xa_destroy(&dev->qp_table_rq); > mana_ib_gd_destroy_rnic_adapter(dev); > destroy_eqs: > mana_ib_destroy_eqs(dev); > @@ -158,6 +160,7 @@ static void mana_ib_remove(struct auxiliary_device > *adev) > struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev); > > ib_unregister_device(&dev->ib_dev); > + xa_destroy(&dev->qp_table_rq); > mana_ib_gd_destroy_rnic_adapter(dev); > mana_ib_destroy_eqs(dev); > mana_gd_deregister_device(dev->gdma_dev); > diff --git a/drivers/infiniband/hw/mana/main.c > b/drivers/infiniband/hw/mana/main.c > index 365b4f1..dfcfb88 100644 > --- a/drivers/infiniband/hw/mana/main.c > +++ b/drivers/infiniband/hw/mana/main.c > @@ -667,6 +667,39 @@ int mana_ib_gd_query_adapter_caps(struct > mana_ib_dev *dev) > return 0; > } > > +static void > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct > +gdma_event *event) { > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > + struct mana_ib_qp *qp; > + struct ib_event ev; > + unsigned long flag; > + u32 qpn; > + > + switch (event->type) { > + case GDMA_EQE_RNIC_QP_FATAL: > + qpn = event->details[0]; > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > + qp = xa_load(&mdev->qp_table_rq, qpn); > + if (qp) > + refcount_inc(&qp->refcount); Move this to after checking for "if (!qp) break". > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > + if (!qp) > + break; > + if (qp->ibqp.event_handler) { > + ev.device = qp->ibqp.device; > + ev.element.qp = &qp->ibqp; > + ev.event = IB_EVENT_QP_FATAL; > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > + } > + if (refcount_dec_and_test(&qp->refcount)) > + complete(&qp->free); > + break; > + default: > + break; > + } > +} > + > int mana_ib_create_eqs(struct mana_ib_dev *mdev) { > struct gdma_context *gc = mdev_to_gc(mdev); @@ -676,7 +709,7 @@ > int mana_ib_create_eqs(struct mana_ib_dev *mdev) > spec.type = GDMA_EQ; > spec.monitor_avl_buf = false; > spec.queue_size = EQ_SIZE; > - spec.eq.callback = NULL; > + spec.eq.callback = mana_ib_event_handler; > spec.eq.context = mdev; > spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE; > spec.eq.msix_index = 0; > @@ -691,7 +724,7 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev) > err = -ENOMEM; > goto destroy_fatal_eq; > } > - > + spec.eq.callback = NULL; > for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) { > spec.eq.msix_index = (i + 1) % gc->num_msix_usable; > err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, > &mdev->eqs[i]); diff --git a/drivers/infiniband/hw/mana/mana_ib.h > b/drivers/infiniband/hw/mana/mana_ib.h > index 60bc548..b732555 100644 > --- a/drivers/infiniband/hw/mana/mana_ib.h > +++ b/drivers/infiniband/hw/mana/mana_ib.h > @@ -62,6 +62,7 @@ struct mana_ib_dev { > mana_handle_t adapter_handle; > struct gdma_queue *fatal_err_eq; > struct gdma_queue **eqs; > + struct xarray qp_table_rq; > struct mana_ib_adapter_caps adapter_caps; }; > > @@ -124,6 +125,9 @@ struct mana_ib_qp { > > /* The port on the IB device, starting with 1 */ > u32 port; > + > + refcount_t refcount; > + struct completion free; > }; > > struct mana_ib_ucontext { > diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c > index 34a9372..3f4fcc9 100644 > --- a/drivers/infiniband/hw/mana/qp.c > +++ b/drivers/infiniband/hw/mana/qp.c > @@ -460,6 +460,12 @@ static int mana_ib_create_rc_qp(struct ib_qp *ibqp, > struct ib_pd *ibpd, > } > } > > + refcount_set(&qp->refcount, 1); > + init_completion(&qp->free); > + err = xa_insert_irq(&mdev->qp_table_rq, qp->ibqp.qp_num, qp, > GFP_KERNEL); > + if (err) > + goto destroy_qp; > + > return 0; > > destroy_qp: > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct mana_ib_qp > *qp, struct ib_udata *udata) > container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); > int i; > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > + if (refcount_dec_and_test(&qp->refcount)) > + complete(&qp->free); > + wait_for_completion(&qp->free); Strange logic. Why not do: if (!refcount_dec_and_test(&qp->refcount)) wait_for_completion(&qp->free);
On Tue, Jun 04, 2024 at 02:13:39PM +0000, Konstantin Taranov wrote: > > > +static void > > > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct > > > +gdma_event *event) { > > > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > > > + struct mana_ib_qp *qp; > > > + struct ib_event ev; > > > + unsigned long flag; > > > + u32 qpn; > > > + > > > + switch (event->type) { > > > + case GDMA_EQE_RNIC_QP_FATAL: > > > + qpn = event->details[0]; > > > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > > > + qp = xa_load(&mdev->qp_table_rq, qpn); > > > + if (qp) > > > + refcount_inc(&qp->refcount); > > > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > > > + if (!qp) > > > + break; > > > + if (qp->ibqp.event_handler) { > > > + ev.device = qp->ibqp.device; > > > + ev.element.qp = &qp->ibqp; > > > + ev.event = IB_EVENT_QP_FATAL; > > > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > > > + } > > > + if (refcount_dec_and_test(&qp->refcount)) > > > + complete(&qp->free); > > > + break; > > > + default: > > > + break; > > > + } > > > +} > > > > <...> > > > > > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct > > mana_ib_qp *qp, struct ib_udata *udata) > > > container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); > > > int i; > > > > > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > > > + if (refcount_dec_and_test(&qp->refcount)) > > > + complete(&qp->free); > > > + wait_for_completion(&qp->free); > > > > This flow is unclear to me. You are destroying the QP and need to make sure > > that mana_ib_event_handler is not running at that point and not mess with > > refcount and complete. > > Hi, Leon. Thanks for the concern. Here is the clarification: > The flow is similar to what mlx5 does with mlx5_get_rsc and mlx5_core_put_rsc. > When we get a QP resource, we increment the counter while holding the xa lock. > So, when we destroy a QP, the code first removes the QP from the xa to ensure that nobody can get it. > And then we check whether mana_ib_event_handler is holding it with refcount_dec_and_test. > If the QP is held, then we wait for the mana_ib_event_handler to call complete. > Once the wait returns, it is impossible to get the QP referenced, since it is not in the xa and all references have been removed. > So now we can release the QP in HW, so the QPN can be assigned to new QPs. > > Leon, have you spotted a bug? Or just wanted to understand the flow? I understand the "general" flow, but think that implementation is very convoluted here. In addition, mlx5 and other drivers make sure sure that HW object is not free before they free it. They don't "mess" with ibqp, and probably you should do the same. Thanks > Thanks > > > > > Thanks
> > + > > + switch (event->type) { > > + case GDMA_EQE_RNIC_QP_FATAL: > > + qpn = event->details[0]; > > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > > + qp = xa_load(&mdev->qp_table_rq, qpn); > > + if (qp) > > + refcount_inc(&qp->refcount); > > > Move this to after checking for "if (!qp) break". Then we can have a race condition. Imagine that EQE came when we tried to destroy QP and the got the following execution order: EQE | QP destroy 1 xa_lock | 2 qp = xa_load | 3 xa_unlock | 4 | xa_erase_irq 5 | refcount_dec 6 | complete 7 refcount_inc | <-------- BUG > > > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > > + if (!qp) > > + break; > > + if (qp->ibqp.event_handler) { > > + ev.device = qp->ibqp.device; > > + ev.element.qp = &qp->ibqp; > > + ev.event = IB_EVENT_QP_FATAL; > > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > > + } > > + if (refcount_dec_and_test(&qp->refcount)) > > + complete(&qp->free); > > + break; > Strange logic. Why not do: > if (!refcount_dec_and_test(&qp->refcount)) > wait_for_completion(&qp->free); > It might work, but the logic will be even stranger and it will prevent some debugging. With the proposed change, qp->free may not be completed even though the counter is 0. As a result, the change makes an incorrect state to be an expected state, thereby making bugs with that side effect undetectable. E.g., we have a bug "use after free" and then we try to trace whether qp was in use. Plus, it is a good practice deinit everything that was inited. With the proposed change it is violated.
> > > > +static void > > > > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct > > > > +gdma_event *event) { > > > > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > > > > + struct mana_ib_qp *qp; > > > > + struct ib_event ev; > > > > + unsigned long flag; > > > > + u32 qpn; > > > > + > > > > + switch (event->type) { > > > > + case GDMA_EQE_RNIC_QP_FATAL: > > > > + qpn = event->details[0]; > > > > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > > > > + qp = xa_load(&mdev->qp_table_rq, qpn); > > > > + if (qp) > > > > + refcount_inc(&qp->refcount); > > > > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > > > > + if (!qp) > > > > + break; > > > > + if (qp->ibqp.event_handler) { > > > > + ev.device = qp->ibqp.device; > > > > + ev.element.qp = &qp->ibqp; > > > > + ev.event = IB_EVENT_QP_FATAL; > > > > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > > > > + } > > > > + if (refcount_dec_and_test(&qp->refcount)) > > > > + complete(&qp->free); > > > > + break; > > > > + default: > > > > + break; > > > > + } > > > > +} > > > > > > <...> > > > > > > > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct > > > mana_ib_qp *qp, struct ib_udata *udata) > > > > container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); > > > > int i; > > > > > > > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > > > > + if (refcount_dec_and_test(&qp->refcount)) > > > > + complete(&qp->free); > > > > + wait_for_completion(&qp->free); > > > > > > This flow is unclear to me. You are destroying the QP and need to > > > make sure that mana_ib_event_handler is not running at that point > > > and not mess with refcount and complete. > > > > Hi, Leon. Thanks for the concern. Here is the clarification: > > The flow is similar to what mlx5 does with mlx5_get_rsc and > mlx5_core_put_rsc. > > When we get a QP resource, we increment the counter while holding the xa > lock. > > So, when we destroy a QP, the code first removes the QP from the xa to > ensure that nobody can get it. > > And then we check whether mana_ib_event_handler is holding it with > refcount_dec_and_test. > > If the QP is held, then we wait for the mana_ib_event_handler to call > complete. > > Once the wait returns, it is impossible to get the QP referenced, since it is > not in the xa and all references have been removed. > > So now we can release the QP in HW, so the QPN can be assigned to new > QPs. > > > > Leon, have you spotted a bug? Or just wanted to understand the flow? > > I understand the "general" flow, but think that implementation is very > convoluted here. In addition, mlx5 and other drivers make sure sure that HW > object is not free before they free it. They don't "mess" with ibqp, and > probably you should do the same. I can make the code more readable by introducing 4 helpers: add_qp_ref, get_qp_ref, put_qp_ref, destroy_qp_ref. Would it make the code less convoluted for you? The devices are different. Mana can do EQE with QPN, which can be destroyed by OS. With that reference counting I make sure we do not destroy QP which is used in EQE processing. We still destroy the HW resource at same time as before the change. The xa table is just a lookup table, which says whether object can be referenced or not. The change just dictates that we first make a QP not referenceable. Note, that if we remove the QP from the table after we destroy it in HW, we can have a bug due to the collision in the xa table when at the same time another entity creates a QP. Since the QPN is released in HW, it will most likely given to the next create_qp (so mana, unlike mlx, does not assign QPNs with an increment and gives recently used QPN). So, the create_qp can fail as the QPN is still used in the xa. And what do you mean that "don't "mess" with ibqp"? Are you saying that we cannot process QP-related interrupts? What do you propose to change? In any case it will be the same logic: 1) remove from table 2) make sure that current IRQ does not hold a reference I use counters for that as most of other IB providers. I would also appreciate a list of changes to make this patch approved or confirmation that no changes are required. Thanks > Thanks > > > Thanks > > > > > > > > Thanks
On Thu, Jun 06, 2024 at 09:30:51AM +0000, Konstantin Taranov wrote: > > > > > +static void > > > > > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct > > > > > +gdma_event *event) { > > > > > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > > > > > + struct mana_ib_qp *qp; > > > > > + struct ib_event ev; > > > > > + unsigned long flag; > > > > > + u32 qpn; > > > > > + > > > > > + switch (event->type) { > > > > > + case GDMA_EQE_RNIC_QP_FATAL: > > > > > + qpn = event->details[0]; > > > > > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > > > > > + qp = xa_load(&mdev->qp_table_rq, qpn); > > > > > + if (qp) > > > > > + refcount_inc(&qp->refcount); > > > > > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > > > > > + if (!qp) > > > > > + break; > > > > > + if (qp->ibqp.event_handler) { > > > > > + ev.device = qp->ibqp.device; > > > > > + ev.element.qp = &qp->ibqp; > > > > > + ev.event = IB_EVENT_QP_FATAL; > > > > > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); > > > > > + } > > > > > + if (refcount_dec_and_test(&qp->refcount)) > > > > > + complete(&qp->free); > > > > > + break; > > > > > + default: > > > > > + break; > > > > > + } > > > > > +} > > > > > > > > <...> > > > > > > > > > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct > > > > mana_ib_qp *qp, struct ib_udata *udata) > > > > > container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); > > > > > int i; > > > > > > > > > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > > > > > + if (refcount_dec_and_test(&qp->refcount)) > > > > > + complete(&qp->free); > > > > > + wait_for_completion(&qp->free); > > > > > > > > This flow is unclear to me. You are destroying the QP and need to > > > > make sure that mana_ib_event_handler is not running at that point > > > > and not mess with refcount and complete. > > > > > > Hi, Leon. Thanks for the concern. Here is the clarification: > > > The flow is similar to what mlx5 does with mlx5_get_rsc and > > mlx5_core_put_rsc. > > > When we get a QP resource, we increment the counter while holding the xa > > lock. > > > So, when we destroy a QP, the code first removes the QP from the xa to > > ensure that nobody can get it. > > > And then we check whether mana_ib_event_handler is holding it with > > refcount_dec_and_test. > > > If the QP is held, then we wait for the mana_ib_event_handler to call > > complete. > > > Once the wait returns, it is impossible to get the QP referenced, since it is > > not in the xa and all references have been removed. > > > So now we can release the QP in HW, so the QPN can be assigned to new > > QPs. > > > > > > Leon, have you spotted a bug? Or just wanted to understand the flow? > > > > I understand the "general" flow, but think that implementation is very > > convoluted here. In addition, mlx5 and other drivers make sure sure that HW > > object is not free before they free it. They don't "mess" with ibqp, and > > probably you should do the same. > > I can make the code more readable by introducing 4 helpers: add_qp_ref, get_qp_ref, put_qp_ref, destroy_qp_ref. > Would it make the code less convoluted for you? The thing is that you are trying to open-code part of restrack logic, which already has xarray DB per-QPN, maybe you should extend it to support your case, by adding some sort of barrier to prevent QP from being used. > > The devices are different. Mana can do EQE with QPN, which can be destroyed by OS. With that reference counting I make sure > we do not destroy QP which is used in EQE processing. We still destroy the HW resource at same time as before the change. > The xa table is just a lookup table, which says whether object can be referenced or not. The change just dictates that we first > make a QP not referenceable. > > Note, that if we remove the QP from the table after we destroy it in HW, we can have a bug due to the collision in the xa table when > at the same time another entity creates a QP. Since the QPN is released in HW, it will most likely given to the next create_qp (so mana, unlike mlx, > does not assign QPNs with an increment and gives recently used QPN). So, the create_qp can fail as the QPN is still used in the xa. > > And what do you mean that "don't "mess" with ibqp"? Are you saying that we cannot process QP-related interrupts? > What do you propose to change? In any case it will be the same logic: > 1) remove from table > 2) make sure that current IRQ does not hold a reference > I use counters for that as most of other IB providers. > > I would also appreciate a list of changes to make this patch approved or confirmation that no changes are required. I'm not asking to change anything at this point, just trying to see if there is more general way to solve this problem, which exists in all drivers now and in the future. Thanks > Thanks > > > Thanks > > > > > Thanks > > > > > > > > > > > Thanks
> -----Original Message----- > From: Leon Romanovsky <leon@kernel.org> > Sent: Thursday, 6 June 2024 12:51 > To: Konstantin Taranov <kotaranov@microsoft.com> > Cc: Konstantin Taranov <kotaranov@linux.microsoft.com>; Wei Hu > <weh@microsoft.com>; sharmaajay@microsoft.com; Long Li > <longli@microsoft.com>; jgg@ziepe.ca; linux-rdma@vger.kernel.org; linux- > kernel@vger.kernel.org > Subject: [EXTERNAL] Re: [PATCH rdma-next 1/1] RDMA/mana_ib: process QP > error events > > On Thu, Jun 06, 2024 at 09:30:51AM +0000, Konstantin Taranov wrote: > > > > > > +static void > > > > > > +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct > > > > > > +gdma_event *event) { > > > > > > + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; > > > > > > + struct mana_ib_qp *qp; > > > > > > + struct ib_event ev; > > > > > > + unsigned long flag; > > > > > > + u32 qpn; > > > > > > + > > > > > > + switch (event->type) { > > > > > > + case GDMA_EQE_RNIC_QP_FATAL: > > > > > > + qpn = event->details[0]; > > > > > > + xa_lock_irqsave(&mdev->qp_table_rq, flag); > > > > > > + qp = xa_load(&mdev->qp_table_rq, qpn); > > > > > > + if (qp) > > > > > > + refcount_inc(&qp->refcount); > > > > > > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); > > > > > > + if (!qp) > > > > > > + break; > > > > > > + if (qp->ibqp.event_handler) { > > > > > > + ev.device = qp->ibqp.device; > > > > > > + ev.element.qp = &qp->ibqp; > > > > > > + ev.event = IB_EVENT_QP_FATAL; > > > > > > + qp->ibqp.event_handler(&ev, qp- > >ibqp.qp_context); > > > > > > + } > > > > > > + if (refcount_dec_and_test(&qp->refcount)) > > > > > > + complete(&qp->free); > > > > > > + break; > > > > > > + default: > > > > > > + break; > > > > > > + } > > > > > > +} > > > > > > > > > > <...> > > > > > > > > > > > @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct > > > > > mana_ib_qp *qp, struct ib_udata *udata) > > > > > > container_of(qp->ibqp.device, struct mana_ib_dev, > ib_dev); > > > > > > int i; > > > > > > > > > > > > + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); > > > > > > + if (refcount_dec_and_test(&qp->refcount)) > > > > > > + complete(&qp->free); > > > > > > + wait_for_completion(&qp->free); > > > > > > > > > > This flow is unclear to me. You are destroying the QP and need > > > > > to make sure that mana_ib_event_handler is not running at that > > > > > point and not mess with refcount and complete. > > > > > > > > Hi, Leon. Thanks for the concern. Here is the clarification: > > > > The flow is similar to what mlx5 does with mlx5_get_rsc and > > > mlx5_core_put_rsc. > > > > When we get a QP resource, we increment the counter while holding > > > > the xa > > > lock. > > > > So, when we destroy a QP, the code first removes the QP from the > > > > xa to > > > ensure that nobody can get it. > > > > And then we check whether mana_ib_event_handler is holding it with > > > refcount_dec_and_test. > > > > If the QP is held, then we wait for the mana_ib_event_handler to > > > > call > > > complete. > > > > Once the wait returns, it is impossible to get the QP referenced, > > > > since it is > > > not in the xa and all references have been removed. > > > > So now we can release the QP in HW, so the QPN can be assigned to > > > > new > > > QPs. > > > > > > > > Leon, have you spotted a bug? Or just wanted to understand the flow? > > > > > > I understand the "general" flow, but think that implementation is > > > very convoluted here. In addition, mlx5 and other drivers make sure > > > sure that HW object is not free before they free it. They don't > > > "mess" with ibqp, and probably you should do the same. > > > > I can make the code more readable by introducing 4 helpers: add_qp_ref, > get_qp_ref, put_qp_ref, destroy_qp_ref. > > Would it make the code less convoluted for you? > > The thing is that you are trying to open-code part of restrack logic, which > already has xarray DB per-QPN, maybe you should extend it to support your > case, by adding some sort of barrier to prevent QP from being used. > Thanks for the suggestion. I can have a look later to suggest something, but I guess it is hard to generalize for mana. Another blocker, which is not in this patch, is that in mana one RC QP has up to 5 ids (one per workqueue), where only one of them is QPN. We can get CQEs that reference one of 5 ids, which may not be supported by the restrack logic. So in future patches where I add support of send/recv/poll in the kernel, I add more indexes to the table, where most of them are not QPN, but work queue ids. Anyway, I think making helpers at this point is a good idea, as I will get fewer question later when I will send polling. So, I will send v2 tomorrow with the aforementioned helpers. Thanks > > > > The devices are different. Mana can do EQE with QPN, which can be > > destroyed by OS. With that reference counting I make sure we do not > destroy QP which is used in EQE processing. We still destroy the HW resource > at same time as before the change. > > The xa table is just a lookup table, which says whether object can be > > referenced or not. The change just dictates that we first make a QP not > referenceable. > > > > Note, that if we remove the QP from the table after we destroy it in > > HW, we can have a bug due to the collision in the xa table when at the > > same time another entity creates a QP. Since the QPN is released in HW, it > will most likely given to the next create_qp (so mana, unlike mlx, does not > assign QPNs with an increment and gives recently used QPN). So, the > create_qp can fail as the QPN is still used in the xa. > > > > And what do you mean that "don't "mess" with ibqp"? Are you saying that > we cannot process QP-related interrupts? > > What do you propose to change? In any case it will be the same logic: > > 1) remove from table > > 2) make sure that current IRQ does not hold a reference I use counters > > for that as most of other IB providers. > > > > I would also appreciate a list of changes to make this patch approved or > confirmation that no changes are required. > > I'm not asking to change anything at this point, just trying to see if there is > more general way to solve this problem, which exists in all drivers now and in > the future. > Thanks! > Thanks > > > Thanks > > > > > Thanks > > > > > > > Thanks > > > > > > > > > > > > > > Thanks
> > Strange logic. Why not do: > > if (!refcount_dec_and_test(&qp->refcount)) > > wait_for_completion(&qp->free); > > > > It might work, but the logic will be even stranger and it will prevent some > debugging. > With the proposed change, qp->free may not be completed even though the > counter is 0. Why this is a problem? mana_ib_destroy_rc_qp() is the only one waiting on it? > As a result, the change makes an incorrect state to be an expected state, thereby > making bugs with that side effect undetectable. > E.g., we have a bug "use after free" and then we try to trace whether qp was in > use. I don't get it. Can you explain why? > Plus, it is a good practice deinit everything that was inited. With the proposed > change it is violated. You shouldn't call wait_for_completion if it's not needed. This is not a "deinit".
> -----Original Message----- > From: Long Li <longli@microsoft.com> > Sent: Friday, 7 June 2024 04:45 > To: Konstantin Taranov <kotaranov@microsoft.com>; Konstantin Taranov > <kotaranov@linux.microsoft.com>; Wei Hu <weh@microsoft.com>; > sharmaajay@microsoft.com; jgg@ziepe.ca; leon@kernel.org > Cc: linux-rdma@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: RE: [PATCH rdma-next 1/1] RDMA/mana_ib: process QP error events > > > > Strange logic. Why not do: > > > if (!refcount_dec_and_test(&qp->refcount)) > > > wait_for_completion(&qp->free); > > > > > > > It might work, but the logic will be even stranger and it will prevent > > some debugging. > > With the proposed change, qp->free may not be completed even though > > the counter is 0. > > Why this is a problem? mana_ib_destroy_rc_qp() is the only one waiting on > it? Sure, it is not a problem if you do not have a bug. The code is subject to change and bugs could appear. > > > As a result, the change makes an incorrect state to be an expected > > state, thereby making bugs with that side effect undetectable. > > E.g., we have a bug "use after free" and then we try to trace whether > > qp was in use. > > I don't get it. Can you explain why? Please re-read my explanation again. Also please check the kernel code of other drivers that use wait_for_completion. Many of them do the same three lines as I do in this patch. > > > Plus, it is a good practice deinit everything that was inited. With > > the proposed change it is violated. > > You shouldn't call wait_for_completion if it's not needed. This is not a > "deinit". See your message, you proposed to remove complete as well.
diff --git a/drivers/infiniband/hw/mana/device.c b/drivers/infiniband/hw/mana/device.c index 9a7da2e..9eb714e 100644 --- a/drivers/infiniband/hw/mana/device.c +++ b/drivers/infiniband/hw/mana/device.c @@ -126,6 +126,7 @@ static int mana_ib_probe(struct auxiliary_device *adev, if (ret) goto destroy_eqs; + xa_init_flags(&dev->qp_table_rq, XA_FLAGS_LOCK_IRQ); ret = mana_ib_gd_config_mac(dev, ADDR_OP_ADD, mac_addr); if (ret) { ibdev_err(&dev->ib_dev, "Failed to add Mac address, ret %d", @@ -143,6 +144,7 @@ static int mana_ib_probe(struct auxiliary_device *adev, return 0; destroy_rnic: + xa_destroy(&dev->qp_table_rq); mana_ib_gd_destroy_rnic_adapter(dev); destroy_eqs: mana_ib_destroy_eqs(dev); @@ -158,6 +160,7 @@ static void mana_ib_remove(struct auxiliary_device *adev) struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev); ib_unregister_device(&dev->ib_dev); + xa_destroy(&dev->qp_table_rq); mana_ib_gd_destroy_rnic_adapter(dev); mana_ib_destroy_eqs(dev); mana_gd_deregister_device(dev->gdma_dev); diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c index 365b4f1..dfcfb88 100644 --- a/drivers/infiniband/hw/mana/main.c +++ b/drivers/infiniband/hw/mana/main.c @@ -667,6 +667,39 @@ int mana_ib_gd_query_adapter_caps(struct mana_ib_dev *dev) return 0; } +static void +mana_ib_event_handler(void *ctx, struct gdma_queue *q, struct gdma_event *event) +{ + struct mana_ib_dev *mdev = (struct mana_ib_dev *)ctx; + struct mana_ib_qp *qp; + struct ib_event ev; + unsigned long flag; + u32 qpn; + + switch (event->type) { + case GDMA_EQE_RNIC_QP_FATAL: + qpn = event->details[0]; + xa_lock_irqsave(&mdev->qp_table_rq, flag); + qp = xa_load(&mdev->qp_table_rq, qpn); + if (qp) + refcount_inc(&qp->refcount); + xa_unlock_irqrestore(&mdev->qp_table_rq, flag); + if (!qp) + break; + if (qp->ibqp.event_handler) { + ev.device = qp->ibqp.device; + ev.element.qp = &qp->ibqp; + ev.event = IB_EVENT_QP_FATAL; + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); + } + if (refcount_dec_and_test(&qp->refcount)) + complete(&qp->free); + break; + default: + break; + } +} + int mana_ib_create_eqs(struct mana_ib_dev *mdev) { struct gdma_context *gc = mdev_to_gc(mdev); @@ -676,7 +709,7 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev) spec.type = GDMA_EQ; spec.monitor_avl_buf = false; spec.queue_size = EQ_SIZE; - spec.eq.callback = NULL; + spec.eq.callback = mana_ib_event_handler; spec.eq.context = mdev; spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE; spec.eq.msix_index = 0; @@ -691,7 +724,7 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev) err = -ENOMEM; goto destroy_fatal_eq; } - + spec.eq.callback = NULL; for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) { spec.eq.msix_index = (i + 1) % gc->num_msix_usable; err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]); diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h index 60bc548..b732555 100644 --- a/drivers/infiniband/hw/mana/mana_ib.h +++ b/drivers/infiniband/hw/mana/mana_ib.h @@ -62,6 +62,7 @@ struct mana_ib_dev { mana_handle_t adapter_handle; struct gdma_queue *fatal_err_eq; struct gdma_queue **eqs; + struct xarray qp_table_rq; struct mana_ib_adapter_caps adapter_caps; }; @@ -124,6 +125,9 @@ struct mana_ib_qp { /* The port on the IB device, starting with 1 */ u32 port; + + refcount_t refcount; + struct completion free; }; struct mana_ib_ucontext { diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c index 34a9372..3f4fcc9 100644 --- a/drivers/infiniband/hw/mana/qp.c +++ b/drivers/infiniband/hw/mana/qp.c @@ -460,6 +460,12 @@ static int mana_ib_create_rc_qp(struct ib_qp *ibqp, struct ib_pd *ibpd, } } + refcount_set(&qp->refcount, 1); + init_completion(&qp->free); + err = xa_insert_irq(&mdev->qp_table_rq, qp->ibqp.qp_num, qp, GFP_KERNEL); + if (err) + goto destroy_qp; + return 0; destroy_qp: @@ -620,6 +626,11 @@ static int mana_ib_destroy_rc_qp(struct mana_ib_qp *qp, struct ib_udata *udata) container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev); int i; + xa_erase_irq(&mdev->qp_table_rq, qp->ibqp.qp_num); + if (refcount_dec_and_test(&qp->refcount)) + complete(&qp->free); + wait_for_completion(&qp->free); + /* Ignore return code as there is not much we can do about it. * The error message is printed inside. */ diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c index d33b272..ab8adac 100644 --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c @@ -380,6 +380,7 @@ static void mana_gd_process_eqe(struct gdma_queue *eq) case GDMA_EQE_HWC_INIT_EQ_ID_DB: case GDMA_EQE_HWC_INIT_DATA: case GDMA_EQE_HWC_INIT_DONE: + case GDMA_EQE_RNIC_QP_FATAL: if (!eq->eq.callback) break; diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h index 2768413..44c797d 100644 --- a/include/net/mana/gdma.h +++ b/include/net/mana/gdma.h @@ -60,6 +60,7 @@ enum gdma_eqe_type { GDMA_EQE_HWC_INIT_DONE = 131, GDMA_EQE_HWC_SOC_RECONFIG = 132, GDMA_EQE_HWC_SOC_RECONFIG_DATA = 133, + GDMA_EQE_RNIC_QP_FATAL = 176, }; enum {