diff mbox series

[rfc] rdma/verbs: fix a possible uaf when draining a srq attached qp

Message ID 20240526083125.1454440-1-sagi@grimberg.me (mailing list archive)
State RFC
Headers show
Series [rfc] rdma/verbs: fix a possible uaf when draining a srq attached qp | expand

Commit Message

Sagi Grimberg May 26, 2024, 8:31 a.m. UTC
ib_drain_qp does not do drain a shared recv queue (because it is
shared). However in the absence of any guarantees that the recv
completions were consumed, the ulp can reference these completions
after draining the qp and freeing its associated resources, which
is a uaf [1].

We cannot drain a srq like a normal rq, however in ib_drain_qp
once the qp moved to error state, we reap the recv_cq once in
order to prevent consumption of recv completions after the drain.

[1]:
--
[199856.569999] Unable to handle kernel paging request at virtual address 002248778adfd6d0
<....>
[199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
<....>
[199856.827281] Call trace:
[199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
[199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
[199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
[199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
[199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
[199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
[199856.860587]  process_one_work+0x1ec/0x4a0
[199856.864694]  worker_thread+0x48/0x490
[199856.868453]  kthread+0x158/0x160
[199856.871779]  ret_from_fork+0x10/0x18
--

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Note this patch is not yet tested, but sending it for visibility and
early feedback. While nothing prevents ib_drain_cq to process a cq
directly (even if it has another context) I am not convinced if all
the upper layers don't have any assumptions about a single context
consuming the cq, even if it is while it is drained. It is also
possible to to add ib_reap_cq that fences the cq poll context before
reaping the cq, but this may have other side-effects.

This crash was seen in the wild, and not easy to reproduce. I suspect
that moving the nvmet_wq to be unbound expedited the teardown process
exposing a possible uaf for srq attached qps (which nvmet-rdma has a
mode of using).


 drivers/infiniband/core/verbs.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Leon Romanovsky June 2, 2024, 8:19 a.m. UTC | #1
On Sun, May 26, 2024 at 11:31:25AM +0300, Sagi Grimberg wrote:
> ib_drain_qp does not do drain a shared recv queue (because it is
> shared). However in the absence of any guarantees that the recv
> completions were consumed, the ulp can reference these completions
> after draining the qp and freeing its associated resources, which
> is a uaf [1].
> 
> We cannot drain a srq like a normal rq, however in ib_drain_qp
> once the qp moved to error state, we reap the recv_cq once in
> order to prevent consumption of recv completions after the drain.
> 
> [1]:
> --
> [199856.569999] Unable to handle kernel paging request at virtual address 002248778adfd6d0
> <....>
> [199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> <....>
> [199856.827281] Call trace:
> [199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
> [199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
> [199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
> [199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
> [199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
> [199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
> [199856.860587]  process_one_work+0x1ec/0x4a0
> [199856.864694]  worker_thread+0x48/0x490
> [199856.868453]  kthread+0x158/0x160
> [199856.871779]  ret_from_fork+0x10/0x18
> --
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Note this patch is not yet tested, but sending it for visibility and
> early feedback. While nothing prevents ib_drain_cq to process a cq
> directly (even if it has another context) I am not convinced if all
> the upper layers don't have any assumptions about a single context
> consuming the cq, even if it is while it is drained. It is also
> possible to to add ib_reap_cq that fences the cq poll context before
> reaping the cq, but this may have other side-effects.

Did you have a chance to test this patch?
I looked at the code and it seems to be correct change, but I also don't
know about all ULP assumptions.

Thanks

> 
> This crash was seen in the wild, and not easy to reproduce. I suspect
> that moving the nvmet_wq to be unbound expedited the teardown process
> exposing a possible uaf for srq attached qps (which nvmet-rdma has a
> mode of using).
> 
> 
>  drivers/infiniband/core/verbs.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index 94a7f3b0c71c..580e9019e96a 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -2962,6 +2962,17 @@ void ib_drain_qp(struct ib_qp *qp)
>  	ib_drain_sq(qp);
>  	if (!qp->srq)
>  		ib_drain_rq(qp);
> +	else {
> +		/*
> +		 * We cannot drain a srq, however the qp is in error state,
> +		 * and will not generate new recv completions, hence it should
> +		 * be enough to reap the recv cq to cleanup any recv completions
> +		 * that may have placed before we drained. Without this nothing
> +		 * guarantees that the ulp will free resources and only then
> +		 * consume the recv completion.
> +		 */
> +		ib_process_cq_direct(qp->recv_cq, -1);
> +	}
>  }
>  EXPORT_SYMBOL(ib_drain_qp);
>  
> -- 
> 2.40.1
> 
>
Sagi Grimberg June 2, 2024, 8:53 a.m. UTC | #2
On 02/06/2024 11:19, Leon Romanovsky wrote:
> On Sun, May 26, 2024 at 11:31:25AM +0300, Sagi Grimberg wrote:
>> ib_drain_qp does not do drain a shared recv queue (because it is
>> shared). However in the absence of any guarantees that the recv
>> completions were consumed, the ulp can reference these completions
>> after draining the qp and freeing its associated resources, which
>> is a uaf [1].
>>
>> We cannot drain a srq like a normal rq, however in ib_drain_qp
>> once the qp moved to error state, we reap the recv_cq once in
>> order to prevent consumption of recv completions after the drain.
>>
>> [1]:
>> --
>> [199856.569999] Unable to handle kernel paging request at virtual address 002248778adfd6d0
>> <....>
>> [199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
>> <....>
>> [199856.827281] Call trace:
>> [199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
>> [199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
>> [199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
>> [199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
>> [199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
>> [199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
>> [199856.860587]  process_one_work+0x1ec/0x4a0
>> [199856.864694]  worker_thread+0x48/0x490
>> [199856.868453]  kthread+0x158/0x160
>> [199856.871779]  ret_from_fork+0x10/0x18
>> --
>>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>> Note this patch is not yet tested, but sending it for visibility and
>> early feedback. While nothing prevents ib_drain_cq to process a cq
>> directly (even if it has another context) I am not convinced if all
>> the upper layers don't have any assumptions about a single context
>> consuming the cq, even if it is while it is drained. It is also
>> possible to to add ib_reap_cq that fences the cq poll context before
>> reaping the cq, but this may have other side-effects.
> Did you have a chance to test this patch?
> I looked at the code and it seems to be correct change, but I also don't
> know about all ULP assumptions.

Not yet...

One thing that is problematic with this patch though is that there is no
stopping condition to the direct poll. So if the CQ is shared among a 
number of
qps (and srq's), nothing prevents the polling from consume completions 
forever...

So we probably need it to be:
--
diff --git a/drivers/infiniband/core/verbs.c 
b/drivers/infiniband/core/verbs.c
index 580e9019e96a..f411fef35938 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2971,7 +2971,7 @@ void ib_drain_qp(struct ib_qp *qp)
                  * guarantees that the ulp will free resources and only 
then
                  * consume the recv completion.
                  */
-               ib_process_cq_direct(qp->recv_cq, -1);
+               ib_process_cq_direct(qp->recv_cq, qp->recv_cq->cqe);
         }
  }
  EXPORT_SYMBOL(ib_drain_qp);
--

While this can also be an overkill, because a shared CQ can be considerably
larger the the QP size. But at least it is finite.
Max Gurtovoy June 2, 2024, 11:43 a.m. UTC | #3
Hi Sagi,

On 02/06/2024 11:53, Sagi Grimberg wrote:
>
>
> On 02/06/2024 11:19, Leon Romanovsky wrote:
>> On Sun, May 26, 2024 at 11:31:25AM +0300, Sagi Grimberg wrote:
>>> ib_drain_qp does not do drain a shared recv queue (because it is
>>> shared). However in the absence of any guarantees that the recv
>>> completions were consumed, the ulp can reference these completions
>>> after draining the qp and freeing its associated resources, which
>>> is a uaf [1].
>>>
>>> We cannot drain a srq like a normal rq, however in ib_drain_qp
>>> once the qp moved to error state, we reap the recv_cq once in
>>> order to prevent consumption of recv completions after the drain.
>>>
>>> [1]:
>>> -- 
>>> [199856.569999] Unable to handle kernel paging request at virtual 
>>> address 002248778adfd6d0
>>> <....>
>>> [199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
>>> <....>
>>> [199856.827281] Call trace:
>>> [199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
>>> [199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
>>> [199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
>>> [199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
>>> [199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
>>> [199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
>>> [199856.860587]  process_one_work+0x1ec/0x4a0
>>> [199856.864694]  worker_thread+0x48/0x490
>>> [199856.868453]  kthread+0x158/0x160
>>> [199856.871779]  ret_from_fork+0x10/0x18
>>> -- 
>>>
>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>> ---
>>> Note this patch is not yet tested, but sending it for visibility and
>>> early feedback. While nothing prevents ib_drain_cq to process a cq
>>> directly (even if it has another context) I am not convinced if all
>>> the upper layers don't have any assumptions about a single context
>>> consuming the cq, even if it is while it is drained. It is also
>>> possible to to add ib_reap_cq that fences the cq poll context before
>>> reaping the cq, but this may have other side-effects.
>> Did you have a chance to test this patch?
>> I looked at the code and it seems to be correct change, but I also don't
>> know about all ULP assumptions.
>
> Not yet...
>
> One thing that is problematic with this patch though is that there is no
> stopping condition to the direct poll. So if the CQ is shared among a 
> number of
> qps (and srq's), nothing prevents the polling from consume completions 
> forever...
>
> So we probably need it to be:
> -- 
> diff --git a/drivers/infiniband/core/verbs.c 
> b/drivers/infiniband/core/verbs.c
> index 580e9019e96a..f411fef35938 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -2971,7 +2971,7 @@ void ib_drain_qp(struct ib_qp *qp)
>                  * guarantees that the ulp will free resources and 
> only then
>                  * consume the recv completion.
>                  */
> -               ib_process_cq_direct(qp->recv_cq, -1);
> +               ib_process_cq_direct(qp->recv_cq, qp->recv_cq->cqe);

I tried to fix the problem few years ago in [1] but eventually the 
patchset was not accepted.

We should probably try to improve the original series (using cq->cqe 
instead of -1 for example) and upstream it.

[1] 
https://lore.kernel.org/all/1516197178-26493-1-git-send-email-maxg@mellanox.com/

> }
>  }
>  EXPORT_SYMBOL(ib_drain_qp);
> -- 
>
> While this can also be an overkill, because a shared CQ can be 
> considerably
> larger the the QP size. But at least it is finite.
Sagi Grimberg June 2, 2024, 1:08 p.m. UTC | #4
On 02/06/2024 14:43, Max Gurtovoy wrote:
> Hi Sagi,
>
> On 02/06/2024 11:53, Sagi Grimberg wrote:
>>
>>
>> On 02/06/2024 11:19, Leon Romanovsky wrote:
>>> On Sun, May 26, 2024 at 11:31:25AM +0300, Sagi Grimberg wrote:
>>>> ib_drain_qp does not do drain a shared recv queue (because it is
>>>> shared). However in the absence of any guarantees that the recv
>>>> completions were consumed, the ulp can reference these completions
>>>> after draining the qp and freeing its associated resources, which
>>>> is a uaf [1].
>>>>
>>>> We cannot drain a srq like a normal rq, however in ib_drain_qp
>>>> once the qp moved to error state, we reap the recv_cq once in
>>>> order to prevent consumption of recv completions after the drain.
>>>>
>>>> [1]:
>>>> -- 
>>>> [199856.569999] Unable to handle kernel paging request at virtual 
>>>> address 002248778adfd6d0
>>>> <....>
>>>> [199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
>>>> <....>
>>>> [199856.827281] Call trace:
>>>> [199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
>>>> [199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
>>>> [199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
>>>> [199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
>>>> [199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
>>>> [199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
>>>> [199856.860587]  process_one_work+0x1ec/0x4a0
>>>> [199856.864694]  worker_thread+0x48/0x490
>>>> [199856.868453]  kthread+0x158/0x160
>>>> [199856.871779]  ret_from_fork+0x10/0x18
>>>> -- 
>>>>
>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>> ---
>>>> Note this patch is not yet tested, but sending it for visibility and
>>>> early feedback. While nothing prevents ib_drain_cq to process a cq
>>>> directly (even if it has another context) I am not convinced if all
>>>> the upper layers don't have any assumptions about a single context
>>>> consuming the cq, even if it is while it is drained. It is also
>>>> possible to to add ib_reap_cq that fences the cq poll context before
>>>> reaping the cq, but this may have other side-effects.
>>> Did you have a chance to test this patch?
>>> I looked at the code and it seems to be correct change, but I also 
>>> don't
>>> know about all ULP assumptions.
>>
>> Not yet...
>>
>> One thing that is problematic with this patch though is that there is no
>> stopping condition to the direct poll. So if the CQ is shared among a 
>> number of
>> qps (and srq's), nothing prevents the polling from consume 
>> completions forever...
>>
>> So we probably need it to be:
>> -- 
>> diff --git a/drivers/infiniband/core/verbs.c 
>> b/drivers/infiniband/core/verbs.c
>> index 580e9019e96a..f411fef35938 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -2971,7 +2971,7 @@ void ib_drain_qp(struct ib_qp *qp)
>>                  * guarantees that the ulp will free resources and 
>> only then
>>                  * consume the recv completion.
>>                  */
>> -               ib_process_cq_direct(qp->recv_cq, -1);
>> +               ib_process_cq_direct(qp->recv_cq, qp->recv_cq->cqe);
>
> I tried to fix the problem few years ago in [1] but eventually the 
> patchset was not accepted.
>
> We should probably try to improve the original series (using cq->cqe 
> instead of -1 for example) and upstream it.
>
> [1] 
> https://lore.kernel.org/all/1516197178-26493-1-git-send-email-maxg@mellanox.com/

Yes. I'd like to know if the qp event is needed though, given that the 
qp is already in error state (from the sq drain)
and then we run a single pass over the recv_cq, is it possible to miss 
any completions.

Would like to resolve the drain without involving the ulp drivers.

If so, then perhaps we can intercept the qp event at the core, before 
sending it to the ulp and there we can complete
the srq completion.
Sagi Grimberg June 6, 2024, 10:13 a.m. UTC | #5
On 02/06/2024 16:08, Sagi Grimberg wrote:
>
>
> On 02/06/2024 14:43, Max Gurtovoy wrote:
>> Hi Sagi,
>>
>> On 02/06/2024 11:53, Sagi Grimberg wrote:
>>>
>>>
>>> On 02/06/2024 11:19, Leon Romanovsky wrote:
>>>> On Sun, May 26, 2024 at 11:31:25AM +0300, Sagi Grimberg wrote:
>>>>> ib_drain_qp does not do drain a shared recv queue (because it is
>>>>> shared). However in the absence of any guarantees that the recv
>>>>> completions were consumed, the ulp can reference these completions
>>>>> after draining the qp and freeing its associated resources, which
>>>>> is a uaf [1].
>>>>>
>>>>> We cannot drain a srq like a normal rq, however in ib_drain_qp
>>>>> once the qp moved to error state, we reap the recv_cq once in
>>>>> order to prevent consumption of recv completions after the drain.
>>>>>
>>>>> [1]:
>>>>> -- 
>>>>> [199856.569999] Unable to handle kernel paging request at virtual 
>>>>> address 002248778adfd6d0
>>>>> <....>
>>>>> [199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
>>>>> <....>
>>>>> [199856.827281] Call trace:
>>>>> [199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
>>>>> [199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
>>>>> [199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
>>>>> [199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
>>>>> [199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
>>>>> [199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
>>>>> [199856.860587]  process_one_work+0x1ec/0x4a0
>>>>> [199856.864694]  worker_thread+0x48/0x490
>>>>> [199856.868453]  kthread+0x158/0x160
>>>>> [199856.871779]  ret_from_fork+0x10/0x18
>>>>> -- 
>>>>>
>>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>>> ---
>>>>> Note this patch is not yet tested, but sending it for visibility and
>>>>> early feedback. While nothing prevents ib_drain_cq to process a cq
>>>>> directly (even if it has another context) I am not convinced if all
>>>>> the upper layers don't have any assumptions about a single context
>>>>> consuming the cq, even if it is while it is drained. It is also
>>>>> possible to to add ib_reap_cq that fences the cq poll context before
>>>>> reaping the cq, but this may have other side-effects.
>>>> Did you have a chance to test this patch?
>>>> I looked at the code and it seems to be correct change, but I also 
>>>> don't
>>>> know about all ULP assumptions.
>>>
>>> Not yet...
>>>
>>> One thing that is problematic with this patch though is that there 
>>> is no
>>> stopping condition to the direct poll. So if the CQ is shared among 
>>> a number of
>>> qps (and srq's), nothing prevents the polling from consume 
>>> completions forever...
>>>
>>> So we probably need it to be:
>>> -- 
>>> diff --git a/drivers/infiniband/core/verbs.c 
>>> b/drivers/infiniband/core/verbs.c
>>> index 580e9019e96a..f411fef35938 100644
>>> --- a/drivers/infiniband/core/verbs.c
>>> +++ b/drivers/infiniband/core/verbs.c
>>> @@ -2971,7 +2971,7 @@ void ib_drain_qp(struct ib_qp *qp)
>>>                  * guarantees that the ulp will free resources and 
>>> only then
>>>                  * consume the recv completion.
>>>                  */
>>> -               ib_process_cq_direct(qp->recv_cq, -1);
>>> +               ib_process_cq_direct(qp->recv_cq, qp->recv_cq->cqe);
>>
>> I tried to fix the problem few years ago in [1] but eventually the 
>> patchset was not accepted.
>>
>> We should probably try to improve the original series (using cq->cqe 
>> instead of -1 for example) and upstream it.
>>
>> [1] 
>> https://lore.kernel.org/all/1516197178-26493-1-git-send-email-maxg@mellanox.com/
>
> Yes. I'd like to know if the qp event is needed though, given that the 
> qp is already in error state (from the sq drain)
> and then we run a single pass over the recv_cq, is it possible to miss 
> any completions.
>
> Would like to resolve the drain without involving the ulp drivers.
>
> If so, then perhaps we can intercept the qp event at the core, before 
> sending it to the ulp and there we can complete
> the srq completion.

Ping?
Max Gurtovoy June 6, 2024, 11:30 a.m. UTC | #6
On 06/06/2024 13:13, Sagi Grimberg wrote:
>
>
> On 02/06/2024 16:08, Sagi Grimberg wrote:
>>
>>
>> On 02/06/2024 14:43, Max Gurtovoy wrote:
>>> Hi Sagi,
>>>
>>> On 02/06/2024 11:53, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 02/06/2024 11:19, Leon Romanovsky wrote:
>>>>> On Sun, May 26, 2024 at 11:31:25AM +0300, Sagi Grimberg wrote:
>>>>>> ib_drain_qp does not do drain a shared recv queue (because it is
>>>>>> shared). However in the absence of any guarantees that the recv
>>>>>> completions were consumed, the ulp can reference these completions
>>>>>> after draining the qp and freeing its associated resources, which
>>>>>> is a uaf [1].
>>>>>>
>>>>>> We cannot drain a srq like a normal rq, however in ib_drain_qp
>>>>>> once the qp moved to error state, we reap the recv_cq once in
>>>>>> order to prevent consumption of recv completions after the drain.
>>>>>>
>>>>>> [1]:
>>>>>> -- 
>>>>>> [199856.569999] Unable to handle kernel paging request at virtual 
>>>>>> address 002248778adfd6d0
>>>>>> <....>
>>>>>> [199856.721701] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
>>>>>> <....>
>>>>>> [199856.827281] Call trace:
>>>>>> [199856.829847]  nvmet_parse_admin_cmd+0x34/0x178 [nvmet]
>>>>>> [199856.835007]  nvmet_req_init+0x2e0/0x378 [nvmet]
>>>>>> [199856.839640]  nvmet_rdma_handle_command+0xa4/0x2e8 [nvmet_rdma]
>>>>>> [199856.845575]  nvmet_rdma_recv_done+0xcc/0x240 [nvmet_rdma]
>>>>>> [199856.851109]  __ib_process_cq+0x84/0xf0 [ib_core]
>>>>>> [199856.855858]  ib_cq_poll_work+0x34/0xa0 [ib_core]
>>>>>> [199856.860587]  process_one_work+0x1ec/0x4a0
>>>>>> [199856.864694]  worker_thread+0x48/0x490
>>>>>> [199856.868453]  kthread+0x158/0x160
>>>>>> [199856.871779]  ret_from_fork+0x10/0x18
>>>>>> -- 
>>>>>>
>>>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>>>> ---
>>>>>> Note this patch is not yet tested, but sending it for visibility and
>>>>>> early feedback. While nothing prevents ib_drain_cq to process a cq
>>>>>> directly (even if it has another context) I am not convinced if all
>>>>>> the upper layers don't have any assumptions about a single context
>>>>>> consuming the cq, even if it is while it is drained. It is also
>>>>>> possible to to add ib_reap_cq that fences the cq poll context before
>>>>>> reaping the cq, but this may have other side-effects.
>>>>> Did you have a chance to test this patch?
>>>>> I looked at the code and it seems to be correct change, but I also 
>>>>> don't
>>>>> know about all ULP assumptions.
>>>>
>>>> Not yet...
>>>>
>>>> One thing that is problematic with this patch though is that there 
>>>> is no
>>>> stopping condition to the direct poll. So if the CQ is shared among 
>>>> a number of
>>>> qps (and srq's), nothing prevents the polling from consume 
>>>> completions forever...
>>>>
>>>> So we probably need it to be:
>>>> -- 
>>>> diff --git a/drivers/infiniband/core/verbs.c 
>>>> b/drivers/infiniband/core/verbs.c
>>>> index 580e9019e96a..f411fef35938 100644
>>>> --- a/drivers/infiniband/core/verbs.c
>>>> +++ b/drivers/infiniband/core/verbs.c
>>>> @@ -2971,7 +2971,7 @@ void ib_drain_qp(struct ib_qp *qp)
>>>>                  * guarantees that the ulp will free resources and 
>>>> only then
>>>>                  * consume the recv completion.
>>>>                  */
>>>> -               ib_process_cq_direct(qp->recv_cq, -1);
>>>> +               ib_process_cq_direct(qp->recv_cq, qp->recv_cq->cqe);
>>>
>>> I tried to fix the problem few years ago in [1] but eventually the 
>>> patchset was not accepted.
>>>
>>> We should probably try to improve the original series (using cq->cqe 
>>> instead of -1 for example) and upstream it.
>>>
>>> [1] 
>>> https://lore.kernel.org/all/1516197178-26493-1-git-send-email-maxg@mellanox.com/
>>
>> Yes. I'd like to know if the qp event is needed though, given that 
>> the qp is already in error state (from the sq drain)
>> and then we run a single pass over the recv_cq, is it possible to 
>> miss any completions.
>>
>> Would like to resolve the drain without involving the ulp drivers.
>>
>> If so, then perhaps we can intercept the qp event at the core, before 
>> sending it to the ulp and there we can complete
>> the srq completion.
>
> Ping?

I think that catching the IB_EVENT_QP_LAST_WQE_REACHED in the core and 
complete(&qp->srq_completion) is possible but will require some event 
processing in the ib core.

Probably my patch [1/4] from 
https://lore.kernel.org/all/1516197178-26493-2-git-send-email-maxg@mellanox.com/#r 
should be updated with this logic.
diff mbox series

Patch

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 94a7f3b0c71c..580e9019e96a 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2962,6 +2962,17 @@  void ib_drain_qp(struct ib_qp *qp)
 	ib_drain_sq(qp);
 	if (!qp->srq)
 		ib_drain_rq(qp);
+	else {
+		/*
+		 * We cannot drain a srq, however the qp is in error state,
+		 * and will not generate new recv completions, hence it should
+		 * be enough to reap the recv cq to cleanup any recv completions
+		 * that may have placed before we drained. Without this nothing
+		 * guarantees that the ulp will free resources and only then
+		 * consume the recv completion.
+		 */
+		ib_process_cq_direct(qp->recv_cq, -1);
+	}
 }
 EXPORT_SYMBOL(ib_drain_qp);