Message ID | 20220404215059.39819-4-rpearsonhpe@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fix race conditions in rxe_pool | expand |
On Mon, Apr 04, 2022 at 04:50:53PM -0500, Bob Pearson wrote: > In the tasklets (completer, responder, and requester) check the > return value from rxe_get() to detect failures to get a reference. > This only occurs if the qp has had its reference count drop to > zero which indicates that it no longer should be used. This is > in preparation to an upcoming change that will move the qp cleanup > code to rxe_qp_cleanup(). These need some comments explaining how this is safe.. It looks to me like it works because the 0 ref keeps the memory alive while a work queue triggers rxe_cleanup_task() (though who fences the responder task?) At least after the next patch, I'm a little unclear how this works at this moment.. Jason
On 4/8/22 12:52, Jason Gunthorpe wrote: > On Mon, Apr 04, 2022 at 04:50:53PM -0500, Bob Pearson wrote: >> In the tasklets (completer, responder, and requester) check the >> return value from rxe_get() to detect failures to get a reference. >> This only occurs if the qp has had its reference count drop to >> zero which indicates that it no longer should be used. This is >> in preparation to an upcoming change that will move the qp cleanup >> code to rxe_qp_cleanup(). > > These need some comments explaining how this is safe.. > > It looks to me like it works because the 0 ref keeps the memory alive > while a work queue triggers rxe_cleanup_task() (though who fences the > responder task?) > > At least after the next patch, I'm a little unclear how this works > at this moment.. > > Jason I started writing the comment (here) If rxe_get() fails qp is not going to be around for long because its ref count has gone to zero and rxe_complete() is cleaning up and returning to rdma-core which will free the qp. However rxe_do_qp_cleanup() has to finish first and it will wait for the tasklets to finish running. This fixes a hard bug to solve since the code calling rxe_run_task() will hold a valid reference on qp but the tasklet can be deferred until later and that reference may be gone when the tasklet starts. but I realized that at the end of the day there isn't a problem because complete/wait_for_completion together with qp cleanup code shutting down the tasklets means that the race won't happen once the series is all in place. So I will just drop that patch. Bob
diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index 138b3e7d3a5f..da3a398053b8 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -562,7 +562,8 @@ int rxe_completer(void *arg) enum comp_state state; int ret = 0; - rxe_get(qp); + if (!rxe_get(qp)) + return -EAGAIN; if (!qp->valid || qp->req.state == QP_STATE_ERROR || qp->req.state == QP_STATE_RESET) { diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index ae5fbc79dd5c..27aba921cc66 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -611,7 +611,8 @@ int rxe_requester(void *arg) struct rxe_ah *ah; struct rxe_av *av; - rxe_get(qp); + if (!rxe_get(qp)) + return -EAGAIN; next_wqe: if (unlikely(!qp->valid || qp->req.state == QP_STATE_ERROR)) diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 16fc7ea1298d..1ed45c192cf5 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -1250,7 +1250,8 @@ int rxe_responder(void *arg) struct rxe_pkt_info *pkt = NULL; int ret = 0; - rxe_get(qp); + if (!rxe_get(qp)) + return -EAGAIN; qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED;
In the tasklets (completer, responder, and requester) check the return value from rxe_get() to detect failures to get a reference. This only occurs if the qp has had its reference count drop to zero which indicates that it no longer should be used. This is in preparation to an upcoming change that will move the qp cleanup code to rxe_qp_cleanup(). Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> --- drivers/infiniband/sw/rxe/rxe_comp.c | 3 ++- drivers/infiniband/sw/rxe/rxe_req.c | 3 ++- drivers/infiniband/sw/rxe/rxe_resp.c | 3 ++- 3 files changed, 6 insertions(+), 3 deletions(-)