diff mbox series

[2/2] RDMA/rxe: Handle remote errors in the midst of a Read reply sequence

Message ID 20221013014724.3786212-2-matsuda-daisuke@fujitsu.com (mailing list archive)
State Accepted
Delegated to: Jason Gunthorpe
Headers show
Series [1/2] RDMA/rxe: Make responder handle RDMA Read failures | expand

Commit Message

Daisuke Matsuda (Fujitsu) Oct. 13, 2022, 1:47 a.m. UTC
Requesting nodes do not handle a reported error correctly if it is
generated in the middle of multi-packet Read responses, and the node tries
to resend the request endlessly. Let completer terminate the connection in
that case.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
FOR REVIEWERS:
  I referred to IB Specification Vol 1-Revision-1.5 to make this patch.
  Please see Ch.9.9.2.2, Ch.9.9.2.4.2 and Table 59.

 drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Li Zhijian Oct. 13, 2022, 5:36 a.m. UTC | #1
On 13/10/2022 09:47, Daisuke Matsuda wrote:
> Requesting nodes do not handle a reported error correctly if it is
> generated in the middle of multi-packet Read responses, and the node tries
> to resend the request endlessly. Let completer terminate the connection in
> that case.
>
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
> FOR REVIEWERS:
>    I referred to IB Specification Vol 1-Revision-1.5 to make this patch.
>    Please see Ch.9.9.2.2, Ch.9.9.2.4.2 and Table 59.
>
>   drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index fb0c008af78c..c9170dd99f3a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -200,6 +200,10 @@ static inline enum comp_state check_psn(struct rxe_qp *qp,
>   		 */
>   		if (pkt->psn == wqe->last_psn)
>   			return COMPST_COMP_ACK;
> +		else if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE &&
> +			 (qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST ||
> +			  qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE))
When IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST or IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE will be assigned to qp->comp.opcode ?
I wonder if "(pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE) "  is enough ?

Thanks
Zhijian


> +			return COMPST_CHECK_ACK;
>   		else
>   			return COMPST_DONE;
>   	} else if ((diff > 0) && (wqe->mask & WR_ATOMIC_OR_READ_MASK)) {
> @@ -228,6 +232,10 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
>   
>   	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
>   	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
> +		/* Check NAK code to handle a remote error */
> +		if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE)
> +			break;
> +
>   		if (pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE &&
>   		    pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) {
>   			/* read retries of partial data may restart from
Daisuke Matsuda (Fujitsu) Oct. 14, 2022, 2:35 a.m. UTC | #2
On Thu, Oct 13, 2022 2:36 PM Li Zhijian wrote:
> On 13/10/2022 09:47, Daisuke Matsuda wrote:
> > Requesting nodes do not handle a reported error correctly if it is
> > generated in the middle of multi-packet Read responses, and the node tries
> > to resend the request endlessly. Let completer terminate the connection in
> > that case.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> > FOR REVIEWERS:
> >    I referred to IB Specification Vol 1-Revision-1.5 to make this patch.
> >    Please see Ch.9.9.2.2, Ch.9.9.2.4.2 and Table 59.
> >
> >   drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++++++
> >   1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> > index fb0c008af78c..c9170dd99f3a 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> > @@ -200,6 +200,10 @@ static inline enum comp_state check_psn(struct rxe_qp *qp,
> >   		 */
> >   		if (pkt->psn == wqe->last_psn)
> >   			return COMPST_COMP_ACK;
> > +		else if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE &&
> > +			 (qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST ||
> > +			  qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE))

Hi Zhijian,
Thank you for taking a look.

> When IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST or IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE will
> be assigned to qp->comp.opcode ?
This happens in rxe_completer(). Upon receiving a Read reply, 'state' is changed in the following order: 
[  101.215593] rdma_rxe: qp#17 state = GET ACK
[  101.216174] rdma_rxe: qp#17 state = GET WQE
[  101.216776] rdma_rxe: qp#17 state = CHECK PSN
[  101.217384] rdma_rxe: qp#17 state = CHECK ACK
[  101.218008] rdma_rxe: qp#17 state = READ
[  101.218556] rdma_rxe: qp#17 state = UPDATE COMP
[  101.219188] rdma_rxe: qp#17 state = DONE
'qp->comp.opcode' is filled at COMPST_UPDATE_COMP, and the value is retained
until it is overridden by the next incoming reply.

> I wonder if "(pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE) "  is enough ?
I am not very sure if it is better. I agree your suggestion is correct within the
IB transport layer, but RoCEv2 traffic is encapsulated in UDP/IP headers.
We may need to take it into consideration that UDP does not guarantee ordered
delivery of packets.

For example, responder can return coalesced ACKs for RDMA Write requests.
If ACK packets are swapped in the course of delivery, a packet with older psn can
arrive later, and probably it is to be discarded by returning COMPST_DONE here.

If we just put "(pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE) ",
I am afraid the swapped ACKs may go into wrong path. This is based on my
speculation, so please let me know if it is wrong.

Thanks,
Daisuke

> 
> Thanks
> Zhijian
> 
> 
> > +			return COMPST_CHECK_ACK;
> >   		else
> >   			return COMPST_DONE;
> >   	} else if ((diff > 0) && (wqe->mask & WR_ATOMIC_OR_READ_MASK)) {
> > @@ -228,6 +232,10 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
> >
> >   	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
> >   	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
> > +		/* Check NAK code to handle a remote error */
> > +		if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE)
> > +			break;
> > +
> >   		if (pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE &&
> >   		    pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) {
> >   			/* read retries of partial data may restart from
Li Zhijian Oct. 15, 2022, 2:32 a.m. UTC | #3
On 14/10/2022 10:35, Matsuda, Daisuke/松田 大輔 wrote:
> On Thu, Oct 13, 2022 2:36 PM Li Zhijian wrote:
>> On 13/10/2022 09:47, Daisuke Matsuda wrote:
>>> Requesting nodes do not handle a reported error correctly if it is
>>> generated in the middle of multi-packet Read responses, and the node tries
>>> to resend the request endlessly. Let completer terminate the connection in
>>> that case.
>>>
>>> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
>>> ---
>>> FOR REVIEWERS:
>>>     I referred to IB Specification Vol 1-Revision-1.5 to make this patch.
>>>     Please see Ch.9.9.2.2, Ch.9.9.2.4.2 and Table 59.
>>>
>>>    drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
>>> index fb0c008af78c..c9170dd99f3a 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
>>> @@ -200,6 +200,10 @@ static inline enum comp_state check_psn(struct rxe_qp *qp,
>>>    		 */
>>>    		if (pkt->psn == wqe->last_psn)
>>>    			return COMPST_COMP_ACK;
>>> +		else if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE &&
>>> +			 (qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST ||
>>> +			  qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE))
> Hi Zhijian,
> Thank you for taking a look.
>
>> When IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST or IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE will
>> be assigned to qp->comp.opcode ?
> This happens in rxe_completer(). Upon receiving a Read reply, 'state' is changed in the following order:
> [  101.215593] rdma_rxe: qp#17 state = GET ACK
> [  101.216174] rdma_rxe: qp#17 state = GET WQE
> [  101.216776] rdma_rxe: qp#17 state = CHECK PSN
> [  101.217384] rdma_rxe: qp#17 state = CHECK ACK
> [  101.218008] rdma_rxe: qp#17 state = READ
> [  101.218556] rdma_rxe: qp#17 state = UPDATE COMP
> [  101.219188] rdma_rxe: qp#17 state = DONE
> 'qp->comp.opcode' is filled at COMPST_UPDATE_COMP, and the value is retained
> until it is overridden by the next incoming reply.
Thanks for your explanation.


>
>> I wonder if "(pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE) "  is enough ?
> I am not very sure if it is better. I agree your suggestion is correct within the
> IB transport layer, but RoCEv2 traffic is encapsulated in UDP/IP headers.
> We may need to take it into consideration that UDP does not guarantee ordered
> delivery of packets.
>
> For example, responder can return coalesced ACKs for RDMA Write requests.
> If ACK packets are swapped in the course of delivery, a packet with older psn can
> arrive later, and probably it is to be discarded by returning COMPST_DONE here.
>
> If we just put "(pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE) ",
> I am afraid the swapped ACKs may go into wrong path. This is based on my
> speculation, so please let me know if it is wrong.
Well, you current changes are quite fit to your patch subject :)

Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>



>
> Thanks,
> Daisuke
>
>> Thanks
>> Zhijian
>>
>>
>>> +			return COMPST_CHECK_ACK;
>>>    		else
>>>    			return COMPST_DONE;
>>>    	} else if ((diff > 0) && (wqe->mask & WR_ATOMIC_OR_READ_MASK)) {
>>> @@ -228,6 +232,10 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
>>>
>>>    	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
>>>    	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
>>> +		/* Check NAK code to handle a remote error */
>>> +		if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE)
>>> +			break;
>>> +
>>>    		if (pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE &&
>>>    		    pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) {
>>>    			/* read retries of partial data may restart from
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index fb0c008af78c..c9170dd99f3a 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -200,6 +200,10 @@  static inline enum comp_state check_psn(struct rxe_qp *qp,
 		 */
 		if (pkt->psn == wqe->last_psn)
 			return COMPST_COMP_ACK;
+		else if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE &&
+			 (qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST ||
+			  qp->comp.opcode == IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE))
+			return COMPST_CHECK_ACK;
 		else
 			return COMPST_DONE;
 	} else if ((diff > 0) && (wqe->mask & WR_ATOMIC_OR_READ_MASK)) {
@@ -228,6 +232,10 @@  static inline enum comp_state check_ack(struct rxe_qp *qp,
 
 	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
 	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
+		/* Check NAK code to handle a remote error */
+		if (pkt->opcode == IB_OPCODE_RC_ACKNOWLEDGE)
+			break;
+
 		if (pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE &&
 		    pkt->opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) {
 			/* read retries of partial data may restart from