diff mbox

IB/cma: Make timeout dependent on the subnet timeout

Message ID 53566BB5.5030203@acm.org (mailing list archive)
State Rejected
Headers show

Commit Message

Bart Van Assche April 22, 2014, 1:16 p.m. UTC
The default RDMA/CM timeout and retry values are too large for small IB
networks and make the SRP initiator reconnect mechanism unnecessary slow.
Hence make the CM timeout dependent on the subnet timeout in IB networks.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Sean Hefty <sean.hefty@intel.com>
---
 drivers/infiniband/core/cma.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

Hefty, Sean April 22, 2014, 6:41 p.m. UTC | #1
> +static u8 cma_get_ib_subnet_timeout(struct rdma_cm_id *id)
> +{
> +	struct ib_port_attr attr;
> +	int ret;
> +
> +	ret = ib_query_port(id->device, id->port_num, &attr);
> +	return ret == 0 ? attr.subnet_timeout : 18;
> +}

Can we query the port once (or only on a change) and cache the result, rather than querying it for every request?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hal Rosenstock April 23, 2014, 12:30 p.m. UTC | #2
On 4/22/2014 2:41 PM, Hefty, Sean wrote:
>> +static u8 cma_get_ib_subnet_timeout(struct rdma_cm_id *id)
>> +{
>> +	struct ib_port_attr attr;
>> +	int ret;
>> +
>> +	ret = ib_query_port(id->device, id->port_num, &attr);
>> +	return ret == 0 ? attr.subnet_timeout : 18;
>> +}
> 
> Can we query the port once (or only on a change) and cache the result, rather than querying it for every request?

To be IBA spec compliant, SubnetTimeout could change so some new local
event would need to be added and handled to avoid the requerying.

In practice, however, that's not very likely AFAIK.

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 23, 2014, 12:46 p.m. UTC | #3
On 04/23/14 14:30, Hal Rosenstock wrote:
> On 4/22/2014 2:41 PM, Hefty, Sean wrote:
>>> +static u8 cma_get_ib_subnet_timeout(struct rdma_cm_id *id)
>>> +{
>>> +	struct ib_port_attr attr;
>>> +	int ret;
>>> +
>>> +	ret = ib_query_port(id->device, id->port_num, &attr);
>>> +	return ret == 0 ? attr.subnet_timeout : 18;
>>> +}
>>
>> Can we query the port once (or only on a change) and cache the result, rather than querying it for every request?
> 
> To be IBA spec compliant, SubnetTimeout could change so some new local
> event would need to be added and handled to avoid the requerying.
> 
> In practice, however, that's not very likely AFAIK.

Thanks Sean and Hal for the feedback.

Regarding SubnetTimeout changes: the code in
drivers/infiniband/core/cache.c already queues a work request after each
port state change. Inside that work request e.g. the P_Key cache is
updated. Would it be acceptable to modify ib_cache_update() such that it
also queries the port attributes and caches these ? Cached port
attributes could e.g. be stored in struct ib_port. However, doing so
would probably require to protect the port_list member in struct
ib_device against concurrent modifications of that list by the sysfs code.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hal Rosenstock April 23, 2014, 12:55 p.m. UTC | #4
On 4/23/2014 8:46 AM, Bart Van Assche wrote:
> On 04/23/14 14:30, Hal Rosenstock wrote:
>> On 4/22/2014 2:41 PM, Hefty, Sean wrote:
>>>> +static u8 cma_get_ib_subnet_timeout(struct rdma_cm_id *id)
>>>> +{
>>>> +	struct ib_port_attr attr;
>>>> +	int ret;
>>>> +
>>>> +	ret = ib_query_port(id->device, id->port_num, &attr);
>>>> +	return ret == 0 ? attr.subnet_timeout : 18;
>>>> +}
>>>
>>> Can we query the port once (or only on a change) and cache the result, rather than querying it for every request?
>>
>> To be IBA spec compliant, SubnetTimeout could change so some new local
>> event would need to be added and handled to avoid the requerying.
>>
>> In practice, however, that's not very likely AFAIK.
> 
> Thanks Sean and Hal for the feedback.
> 
> Regarding SubnetTimeout changes: the code in
> drivers/infiniband/core/cache.c already queues a work request after each
> port state change. Inside that work request e.g. the P_Key cache is
> updated. Would it be acceptable to modify ib_cache_update() such that it
> also queries the port attributes and caches these ? Cached port
> attributes could e.g. be stored in struct ib_port. However, doing so
> would probably require to protect the port_list member in struct
> ib_device against concurrent modifications of that list by the sysfs code.

Unfortunately, it's possible that the subnet timeout could be modified
without any of the local events that currently trigger cache update
occurring.

-- Hal

> Bart.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 23, 2014, 1:44 p.m. UTC | #5
> Regarding SubnetTimeout changes: the code in
> drivers/infiniband/core/cache.c already queues a work request after each
> port state change. Inside that work request e.g. the P_Key cache is
> updated. Would it be acceptable to modify ib_cache_update() such that it
> also queries the port attributes and caches these ? Cached port
> attributes could e.g. be stored in struct ib_port. 

Without looking at details, this at least sounds reasonable.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ira Weiny April 23, 2014, 2:34 p.m. UTC | #6
> 
> On 4/23/2014 8:46 AM, Bart Van Assche wrote:
> > On 04/23/14 14:30, Hal Rosenstock wrote:
> >> On 4/22/2014 2:41 PM, Hefty, Sean wrote:
> >>>> +static u8 cma_get_ib_subnet_timeout(struct rdma_cm_id *id) {
> >>>> +	struct ib_port_attr attr;
> >>>> +	int ret;
> >>>> +
> >>>> +	ret = ib_query_port(id->device, id->port_num, &attr);
> >>>> +	return ret == 0 ? attr.subnet_timeout : 18; }
> >>>
> >>> Can we query the port once (or only on a change) and cache the result,
> rather than querying it for every request?
> >>
> >> To be IBA spec compliant, SubnetTimeout could change so some new
> >> local event would need to be added and handled to avoid the requerying.


> >>
> >> In practice, however, that's not very likely AFAIK.
> >
> > Thanks Sean and Hal for the feedback.
> >
> > Regarding SubnetTimeout changes: the code in
> > drivers/infiniband/core/cache.c already queues a work request after
> > each port state change. Inside that work request e.g. the P_Key cache
> > is updated. Would it be acceptable to modify ib_cache_update() such
> > that it also queries the port attributes and caches these ? Cached
> > port attributes could e.g. be stored in struct ib_port. However, doing
> > so would probably require to protect the port_list member in struct
> > ib_device against concurrent modifications of that list by the sysfs code.
> 
> Unfortunately, it's possible that the subnet timeout could be modified
> without any of the local events that currently trigger cache update occurring.

The Spec lists SubnetTimeout change as an unaffiliated Asynchronous event in section 11.6.3.3.

I think adding this event would be the correct course of action when supported.  For HCA's which do not support "port change" events (which I don't think there are any) I guess a cache could be used.

Does this cache in core/cache.c respond to current events like P_Key table change?

Ira

> 
> -- Hal
> 
> > Bart.
> >
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 24, 2014, 7:39 a.m. UTC | #7
On 23/04/2014 16:44, Hefty, Sean wrote:
>> Regarding SubnetTimeout changes: the code in
>> drivers/infiniband/core/cache.c already queues a work request after each
>> port state change. Inside that work request e.g. the P_Key cache is
>> updated. Would it be acceptable to modify ib_cache_update() such that it
>> also queries the port attributes and caches these ? Cached port
>> attributes could e.g. be stored in struct ib_port.
> Without looking at details, this at least sounds reasonable.
>

Sean, can't we have CMA to follow the same practice used in the CM where 
we derive the RC QP timeout based on the packet life time retrieved in 
path queries? e.g base the cm response time out on this value  too?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 25, 2014, 2:59 a.m. UTC | #8
> Sean, can't we have CMA to follow the same practice used in the CM where
> we derive the RC QP timeout based on the packet life time retrieved in
> path queries? e.g base the cm response time out on this value  too?

We could.  The timeout that's being modified by the patch is the time needed by the remote peer to process the incoming message and send a response.  This time is in addition to the packet life time value that gets used.

For a remote kernel agent, the time needed to respond to a CM message may be fairly small.  For a user space client, the time may be significant, on the order to seconds to minutes.  We can probably make due with a fairly short timeout, provided that MRAs are used by the remote side.

There's no great solution that I can think of.  Maybe the RDMA CM can adjust the timeout based on the remote address, assuming that it can determine if the remote address is a user space or kernel agent.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 25, 2014, 4:21 p.m. UTC | #9
On 04/25/14 04:59, Hefty, Sean wrote:
>> Sean, can't we have CMA to follow the same practice used in the CM where
>> we derive the RC QP timeout based on the packet life time retrieved in
>> path queries? e.g base the cm response time out on this value  too?
> 
> We could.  The timeout that's being modified by the patch is the time needed
> by the remote peer to process the incoming message and send a
response. This
> time is in addition to the packet life time value that gets used.
> 
> For a remote kernel agent, the time needed to respond to a CM message may be
> fairly small.  For a user space client, the time may be significant,
on the
> order to seconds to minutes.  We can probably make due with a fairly short
> timeout, provided that MRAs are used by the remote side.
> 
> There's no great solution that I can think of.  Maybe the RDMA CM can adjust
> the timeout based on the remote address, assuming that it can determine if
> the remote address is a user space or kernel agent.

Another possible approach is to make the CM timeout configurable for
kernel clients only. How about creating two versions of struct
rdma_conn_param - the existing version for communication between user
space and kernel and a second version for in-kernel clients only ? In
that second version a field could be added that allows to specify the CM
timeout.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean April 25, 2014, 4:56 p.m. UTC | #10
> Another possible approach is to make the CM timeout configurable for
> kernel clients only. How about creating two versions of struct
> rdma_conn_param - the existing version for communication between user
> space and kernel and a second version for in-kernel clients only ? In
> that second version a field could be added that allows to specify the CM
> timeout.

A single structure with a new field is sufficient.  A value of 0 could just mean the default.  But the timeout in this instance is per request, with the number of retries set to the max for IB (15).  These are IB specific.  It sounds like you're wanting to limit the total timeout.

Btw, up til now most of the requests regarding CM timeouts has been to increase it.  We need to make sure that we don't end up breaking that.  Even if we set this timeout to 0, the total timeout for a connection request is 2 x packet lifetime x # retries.  With a 1 second packet lifetime, that's still 30 seconds.  What timeout range are you looking for?  The app could always set a separate timer and cancel the request as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche April 25, 2014, 5:47 p.m. UTC | #11
On 04/25/14 18:56, Hefty, Sean wrote:
> Btw, up til now most of the requests regarding CM timeouts has been to
> increase it.  We need to make sure that we don't end up breaking that.
> Even if we set this timeout to 0, the total timeout for a connection
> request is 2 x packet lifetime x # retries.  With a 1 second packet
> lifetime, that's still 30 seconds.  What timeout range are you looking
> for?  The app could always set a separate timer and cancel the request
> as well.

What I had noticed is that with the current timeout settings after a
cable pull it can take really long before a cm_id object reaches
TimeWait exit state. Timeout settings that are between 1/4th and 1/8th
of the current settings result in more reasonable behavior.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 199958d..95528e1 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -60,7 +60,6 @@  MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
 
-#define CMA_CM_RESPONSE_TIMEOUT 20
 #define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
 #define CMA_IBOE_PACKET_LIFETIME 18
@@ -2728,6 +2727,15 @@  out:
 	return ret;
 }
 
+static u8 cma_get_ib_subnet_timeout(struct rdma_cm_id *id)
+{
+	struct ib_port_attr attr;
+	int ret;
+
+	ret = ib_query_port(id->device, id->port_num, &attr);
+	return ret == 0 ? attr.subnet_timeout : 18;
+}
+
 static int cma_resolve_ib_udp(struct rdma_id_private *id_priv,
 			      struct rdma_conn_param *conn_param)
 {
@@ -2735,6 +2743,7 @@  static int cma_resolve_ib_udp(struct rdma_id_private *id_priv,
 	struct ib_cm_id	*id;
 	void *private_data;
 	int offset, ret;
+	u8 cm_response_timeout = cma_get_ib_subnet_timeout(&id_priv->id) + 2;
 
 	memset(&req, 0, sizeof req);
 	offset = cma_user_data_offset(id_priv);
@@ -2771,7 +2780,7 @@  static int cma_resolve_ib_udp(struct rdma_id_private *id_priv,
 
 	req.path = id_priv->id.route.path_rec;
 	req.service_id = rdma_get_service_id(&id_priv->id, cma_dst_addr(id_priv));
-	req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8);
+	req.timeout_ms = 1 << max(cm_response_timeout - 8, 0);
 	req.max_cm_retries = CMA_MAX_CM_RETRIES;
 
 	ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req);
@@ -2792,6 +2801,7 @@  static int cma_connect_ib(struct rdma_id_private *id_priv,
 	void *private_data;
 	struct ib_cm_id	*id;
 	int offset, ret;
+	u8 cm_response_timeout = cma_get_ib_subnet_timeout(&id_priv->id) + 2;
 
 	memset(&req, 0, sizeof req);
 	offset = cma_user_data_offset(id_priv);
@@ -2839,8 +2849,8 @@  static int cma_connect_ib(struct rdma_id_private *id_priv,
 	req.flow_control = conn_param->flow_control;
 	req.retry_count = min_t(u8, 7, conn_param->retry_count);
 	req.rnr_retry_count = min_t(u8, 7, conn_param->rnr_retry_count);
-	req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-	req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
+	req.remote_cm_response_timeout = cm_response_timeout;
+	req.local_cm_response_timeout = cm_response_timeout;
 	req.max_cm_retries = CMA_MAX_CM_RETRIES;
 	req.srq = id_priv->srq ? 1 : 0;