diff mbox series

[2/2] ksmbd: smbd: handle RDMA CM time wait event

Message ID 20220613230119.73475-2-hyc.lee@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/2] ksmbd: remove duplicate flag set in smb2_write | expand

Commit Message

Hyunchul Lee June 13, 2022, 11:01 p.m. UTC
After a QP has been disconnected, it stays
in a timewait state for in flight packets.
After the state has completed,
RDMA_CM_EVENT_TIMEWAIT_EXIT is reported.
Disconnect on RDMA_CM_EVENT_TIMEWAIT_EXIT
so that ksmbd can restart.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
---
 fs/ksmbd/transport_rdma.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Namjae Jeon June 13, 2022, 11:10 p.m. UTC | #1
2022-06-14 8:01 GMT+09:00, Hyunchul Lee <hyc.lee@gmail.com>:
> After a QP has been disconnected, it stays
> in a timewait state for in flight packets.
> After the state has completed,
> RDMA_CM_EVENT_TIMEWAIT_EXIT is reported.
> Disconnect on RDMA_CM_EVENT_TIMEWAIT_EXIT
> so that ksmbd can restart.
>
> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>

Thanks!
Tom Talpey June 14, 2022, 11:56 a.m. UTC | #2
On 6/13/2022 7:01 PM, Hyunchul Lee wrote:
> After a QP has been disconnected, it stays
> in a timewait state for in flight packets.
> After the state has completed,
> RDMA_CM_EVENT_TIMEWAIT_EXIT is reported.
> Disconnect on RDMA_CM_EVENT_TIMEWAIT_EXIT
> so that ksmbd can restart.
> 
> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
> ---
>   fs/ksmbd/transport_rdma.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ksmbd/transport_rdma.c b/fs/ksmbd/transport_rdma.c
> index d035e060c2f0..4b1a471afcd0 100644
> --- a/fs/ksmbd/transport_rdma.c
> +++ b/fs/ksmbd/transport_rdma.c
> @@ -1535,6 +1535,7 @@ static int smb_direct_cm_handler(struct rdma_cm_id *cm_id,
>   		wake_up_interruptible(&t->wait_status);
>   		break;
>   	}
> +	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
>   	case RDMA_CM_EVENT_DEVICE_REMOVAL:
>   	case RDMA_CM_EVENT_DISCONNECTED: {
>   		t->status = SMB_DIRECT_CS_DISCONNECTED;

Is this issue seen on all RDMA providers? Because I would normally
expect that an RDMA_CM_EVENT_DISCONNECTED will precede the TIMEWAIT
event. What scenarios have you seen this not occur?

Unless ksmbd wishes to reuse its QP's, which is not currently the
case (right?), there's pretty much no reason to manage QP state and
hang around for TIMEWAIT.

Tom.
Hyunchul Lee June 15, 2022, 2:14 a.m. UTC | #3
2022년 6월 14일 (화) 오후 8:56, Tom Talpey <tom@talpey.com>님이 작성:
>
>
> On 6/13/2022 7:01 PM, Hyunchul Lee wrote:
> > After a QP has been disconnected, it stays
> > in a timewait state for in flight packets.
> > After the state has completed,
> > RDMA_CM_EVENT_TIMEWAIT_EXIT is reported.
> > Disconnect on RDMA_CM_EVENT_TIMEWAIT_EXIT
> > so that ksmbd can restart.
> >
> > Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
> > ---
> >   fs/ksmbd/transport_rdma.c | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/fs/ksmbd/transport_rdma.c b/fs/ksmbd/transport_rdma.c
> > index d035e060c2f0..4b1a471afcd0 100644
> > --- a/fs/ksmbd/transport_rdma.c
> > +++ b/fs/ksmbd/transport_rdma.c
> > @@ -1535,6 +1535,7 @@ static int smb_direct_cm_handler(struct rdma_cm_id *cm_id,
> >               wake_up_interruptible(&t->wait_status);
> >               break;
> >       }
> > +     case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> >       case RDMA_CM_EVENT_DEVICE_REMOVAL:
> >       case RDMA_CM_EVENT_DISCONNECTED: {
> >               t->status = SMB_DIRECT_CS_DISCONNECTED;
>
> Is this issue seen on all RDMA providers? Because I would normally
> expect that an RDMA_CM_EVENT_DISCONNECTED will precede the TIMEWAIT
> event. What scenarios have you seen this not occur?
>

There was an issue that ksmbd got stuck after attempting to shutdown.
We are trying to reproduce it, but we haven't reproduced it yet,
but It seems to be related to the TIMEWAIT event.

And other drivers such as nvme have disconnected on the TIMEWAIT event.

> Unless ksmbd wishes to reuse its QP's, which is not currently the
> case (right?), there's pretty much no reason to manage QP state and
> hang around for TIMEWAIT.

Right, ksmbd doesn't reuse QP.

>
> Tom.
Tom Talpey June 15, 2022, 6:52 p.m. UTC | #4
On 6/14/2022 10:14 PM, Hyunchul Lee wrote:
> 2022년 6월 14일 (화) 오후 8:56, Tom Talpey <tom@talpey.com>님이 작성:
>>
>>
>> On 6/13/2022 7:01 PM, Hyunchul Lee wrote:
>>> After a QP has been disconnected, it stays
>>> in a timewait state for in flight packets.
>>> After the state has completed,
>>> RDMA_CM_EVENT_TIMEWAIT_EXIT is reported.
>>> Disconnect on RDMA_CM_EVENT_TIMEWAIT_EXIT
>>> so that ksmbd can restart.
>>>
>>> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
>>> ---
>>>    fs/ksmbd/transport_rdma.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/fs/ksmbd/transport_rdma.c b/fs/ksmbd/transport_rdma.c
>>> index d035e060c2f0..4b1a471afcd0 100644
>>> --- a/fs/ksmbd/transport_rdma.c
>>> +++ b/fs/ksmbd/transport_rdma.c
>>> @@ -1535,6 +1535,7 @@ static int smb_direct_cm_handler(struct rdma_cm_id *cm_id,
>>>                wake_up_interruptible(&t->wait_status);
>>>                break;
>>>        }
>>> +     case RDMA_CM_EVENT_TIMEWAIT_EXIT:
>>>        case RDMA_CM_EVENT_DEVICE_REMOVAL:
>>>        case RDMA_CM_EVENT_DISCONNECTED: {
>>>                t->status = SMB_DIRECT_CS_DISCONNECTED;
>>
>> Is this issue seen on all RDMA providers? Because I would normally
>> expect that an RDMA_CM_EVENT_DISCONNECTED will precede the TIMEWAIT
>> event. What scenarios have you seen this not occur?
>>
> 
> There was an issue that ksmbd got stuck after attempting to shutdown.
> We are trying to reproduce it, but we haven't reproduced it yet,
> but It seems to be related to the TIMEWAIT event.

I don't think it's appropriate to add this case to SMB. I think it's
quite unlikely that it will address anything, because an RDMA provider
must have indicated a CM_EVENT_DISCONNECTED prior to any TIMEWAIT.
So, the QP (and connection) will already have been torn down by ksmbd
at the earlier event. Perhaps ksmbd did not properly drain the QP at
the initial disconnect.

 > And other drivers such as nvme have disconnected on the TIMEWAIT event.

NVME is a completely different upper layer, and has different client/
server transport behavior. The SMB session insulates its peers from
most transport errors, and should not be requesting timewait for
its connections, and definitely not waiting for timewait to expire
before initiating teardown (or recovery). The NFS/RDMA client and
server ignore this event, btw.

>> Unless ksmbd wishes to reuse its QP's, which is not currently the
>> case (right?), there's pretty much no reason to manage QP state and
>> hang around for TIMEWAIT.
> 
> Right, ksmbd doesn't reuse QP.

Then there appears to be no good justification for the change. Sorry,
but it's a NAK from me.

Tom.
Hyunchul Lee June 17, 2022, 7:51 a.m. UTC | #5
2022년 6월 16일 (목) 오전 3:53, Tom Talpey <tom@talpey.com>님이 작성:
>
>
> On 6/14/2022 10:14 PM, Hyunchul Lee wrote:
> > 2022년 6월 14일 (화) 오후 8:56, Tom Talpey <tom@talpey.com>님이 작성:
> >>
> >>
> >> On 6/13/2022 7:01 PM, Hyunchul Lee wrote:
> >>> After a QP has been disconnected, it stays
> >>> in a timewait state for in flight packets.
> >>> After the state has completed,
> >>> RDMA_CM_EVENT_TIMEWAIT_EXIT is reported.
> >>> Disconnect on RDMA_CM_EVENT_TIMEWAIT_EXIT
> >>> so that ksmbd can restart.
> >>>
> >>> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
> >>> ---
> >>>    fs/ksmbd/transport_rdma.c | 1 +
> >>>    1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/fs/ksmbd/transport_rdma.c b/fs/ksmbd/transport_rdma.c
> >>> index d035e060c2f0..4b1a471afcd0 100644
> >>> --- a/fs/ksmbd/transport_rdma.c
> >>> +++ b/fs/ksmbd/transport_rdma.c
> >>> @@ -1535,6 +1535,7 @@ static int smb_direct_cm_handler(struct rdma_cm_id *cm_id,
> >>>                wake_up_interruptible(&t->wait_status);
> >>>                break;
> >>>        }
> >>> +     case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> >>>        case RDMA_CM_EVENT_DEVICE_REMOVAL:
> >>>        case RDMA_CM_EVENT_DISCONNECTED: {
> >>>                t->status = SMB_DIRECT_CS_DISCONNECTED;
> >>
> >> Is this issue seen on all RDMA providers? Because I would normally
> >> expect that an RDMA_CM_EVENT_DISCONNECTED will precede the TIMEWAIT
> >> event. What scenarios have you seen this not occur?
> >>
> >
> > There was an issue that ksmbd got stuck after attempting to shutdown.
> > We are trying to reproduce it, but we haven't reproduced it yet,
> > but It seems to be related to the TIMEWAIT event.
>
> I don't think it's appropriate to add this case to SMB. I think it's
> quite unlikely that it will address anything, because an RDMA provider
> must have indicated a CM_EVENT_DISCONNECTED prior to any TIMEWAIT.
> So, the QP (and connection) will already have been torn down by ksmbd
> at the earlier event. Perhaps ksmbd did not properly drain the QP at
> the initial disconnect.
>
>  > And other drivers such as nvme have disconnected on the TIMEWAIT event.
>
> NVME is a completely different upper layer, and has different client/
> server transport behavior. The SMB session insulates its peers from
> most transport errors, and should not be requesting timewait for
> its connections, and definitely not waiting for timewait to expire
> before initiating teardown (or recovery). The NFS/RDMA client and
> server ignore this event, btw.
>

Okay, I got it.
I am looking for the cause and have found some clues.

> >> Unless ksmbd wishes to reuse its QP's, which is not currently the
> >> case (right?), there's pretty much no reason to manage QP state and
> >> hang around for TIMEWAIT.
> >
> > Right, ksmbd doesn't reuse QP.
>
> Then there appears to be no good justification for the change. Sorry,
> but it's a NAK from me.
>

Really thank you for the detailed explanation.

> Tom.
diff mbox series

Patch

diff --git a/fs/ksmbd/transport_rdma.c b/fs/ksmbd/transport_rdma.c
index d035e060c2f0..4b1a471afcd0 100644
--- a/fs/ksmbd/transport_rdma.c
+++ b/fs/ksmbd/transport_rdma.c
@@ -1535,6 +1535,7 @@  static int smb_direct_cm_handler(struct rdma_cm_id *cm_id,
 		wake_up_interruptible(&t->wait_status);
 		break;
 	}
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:
 	case RDMA_CM_EVENT_DISCONNECTED: {
 		t->status = SMB_DIRECT_CS_DISCONNECTED;