mbox series

[rdma-next,0/9] Rework retry algorithm used when sending MADs

Message ID cover.1733405453.git.leon@kernel.org (mailing list archive)
Headers show
Series Rework retry algorithm used when sending MADs | expand

Message

Leon Romanovsky Dec. 5, 2024, 1:49 p.m. UTC
From Vlad,

This series aims to improve behaviour of a MAD sender under congestion
and/or receiver overload.  We've seen significant drops in goodput when
MAD receivers are overloaded.  This typically happens with SA requests,
which are served by a single node (SM), but can also happen with CM.

Patch 7 introduces the main change: exponential backoff.  This new retry
algorithm is applied to all MADs, except RMPP and OPA.  To avoid
reductions in recovery speed under transient failures, the exponential
backoff algorithm only engages after a certain number of linear timeouts
is experienced.  The backoff algorithm resets to beginning after a CM
MRA, assuming the remote is not longer overloaded.

Because a trade-off between speed of recovery under transient failure
and reducing load from unnecessary retries under persistent failure must
be made, and this trade-off depends on the network scale, patch 8 makes
mad-linear-timeouts configurable.

Patch 1 makes CM MRA apply only once, to prevent entering an excessive
delay condition, even when the receiver is likely no longer overloaded.

The exponential backoff algorithm (a) increases the time until a send
MAD reaches the final timeout, and (b) makes it hard to predict by
callers.  Since certain callers appear to care about this, Patch 2
introduces a new option, deadline, which can be used to enforce when
the final timeout is reached.  SA, UMAD and CM are updated to use this
new parameter (patches 3, 5, 6).

Patch 3 also solves a related issue in SA, which configures the MAD
layer with extremely aggressive retry intervals, in certain cases.
Because the current aggressive retry was introduced to solve another
issue, patch 4 makes sa-min-timeout configurable.

Patch 9 resolves another related issue in CM, which uses a retry
interval that is way too high for (low latency) RDMA networks.

In summary:
  1) IB/mad: Apply timeout modification (CM MRA) only once
  2) IB/mad: Add deadline for send MADs
  3) RDMA/sa_query: Enforce min retry interval and deadline
  4) RDMA/nldev: Add sa-min-timeout management attribute
  5) IB/umad: Set deadline when sending non-RMPP MADs
  6) IB/cm: Set deadline when sending MADs
  7) IB/mad: Exponential backoff when retrying sends
  8) RDMA/nldev: Add mad-linear-timeouts management attribute
  9) IB/cma: Lower response timeout to roughly 1s

Two tunables will be added to RDMA tool (iproute2), under the
'management' namespace as follow-up:

  mad-linear-timeouts
  sa-min-timeout

Thanks

Vlad Dumitrescu (9):
  IB/mad: Apply timeout modification (CM MRA) only once
  IB/mad: Add deadline for send MADs
  RDMA/sa_query: Enforce min retry interval and deadline
  RDMA/nldev: Add sa-min-timeout management attribute
  IB/umad: Set deadline when sending non-RMPP MADs
  IB/cm: Set deadline when sending MADs
  IB/mad: Exponential backoff when retrying sends
  RDMA/nldev: Add mad-linear-timeouts management attribute
  IB/cma: Lower response timeout to roughly 1s

 drivers/infiniband/core/cm.c        |  13 +++
 drivers/infiniband/core/cma.c       |   2 +-
 drivers/infiniband/core/core_priv.h |   4 +
 drivers/infiniband/core/mad.c       | 141 ++++++++++++++++++++++++++--
 drivers/infiniband/core/mad_priv.h  |   8 ++
 drivers/infiniband/core/nldev.c     | 133 ++++++++++++++++++++++++++
 drivers/infiniband/core/sa_query.c  |  81 +++++++++++++---
 drivers/infiniband/core/user_mad.c  |   8 ++
 include/rdma/ib_mad.h               |  29 ++++++
 include/uapi/rdma/ib_user_mad.h     |  12 ++-
 include/uapi/rdma/rdma_netlink.h    |   7 ++
 11 files changed, 416 insertions(+), 22 deletions(-)