mbox series

[for-next,v3,00/12] RDMA/rxe: Various fixes and cleanups

Message ID 20240329145513.35381-2-rpearsonhpe@gmail.com (mailing list archive)
Headers show
Series RDMA/rxe: Various fixes and cleanups | expand

Message

Bob Pearson March 29, 2024, 2:55 p.m. UTC
This series of patches is the result of high scale testing on a large
HPC system with a large attached Lustre file system. Several errors
were found which had not been previously seen at smaller scales. In
this case up to 1600 QPs on 1024 compute nodes attached to about 100
flash storage nodes. Each patch has it's own description.

v3
	Fixed an error in "Don't call rxe_requester from rxe_completer"
	Moved run_requester_again from a global to rxe_req_info.again.
	The control parameter has to be local to each qp.
v2
	Minor edits to some of the commit messages.
	Added a missing change to "Don't schedule rxe_completer...".
	Added a missing change to "Git rid of pkt resend on err".
	Added one additional commit.

Bob Pearson (12):
  RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
  RDMA/rxe: Allow good work requests to be executed
  RDMA/rxe: Remove redundant scheduling of rxe_completer
  RDMA/rxe: Merge request and complete tasks
  RDMA/rxe: Remove save/rollback_state in rxe_requester
  RDMA/rxe: Don't schedule rxe_completer from rxe_requester
  RDMA/rxe: Don't call rxe_requester from rxe_completer
  RDMA/rxe: Don't call direct between tasks
  RDMA/rxe: Fix incorrect rxe_put in error path
  RDMA/rxe: Make rxe_loopback match rxe_send behavior
  RDMA/rxe: Get rid of pkt resend on err
  RDMA/rxe: Let destroy qp succeed with stuck packet

 drivers/infiniband/sw/rxe/rxe_comp.c        | 32 ++++----
 drivers/infiniband/sw/rxe/rxe_hw_counters.c |  2 +-
 drivers/infiniband/sw/rxe/rxe_hw_counters.h |  2 +-
 drivers/infiniband/sw/rxe/rxe_loc.h         |  3 +-
 drivers/infiniband/sw/rxe/rxe_net.c         | 69 +++++++++--------
 drivers/infiniband/sw/rxe/rxe_qp.c          | 46 +++++-------
 drivers/infiniband/sw/rxe/rxe_req.c         | 82 ++++++---------------
 drivers/infiniband/sw/rxe/rxe_resp.c        | 14 +---
 drivers/infiniband/sw/rxe/rxe_verbs.c       | 17 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.h       |  7 +-
 10 files changed, 111 insertions(+), 163 deletions(-)

Comments

Jason Gunthorpe April 22, 2024, 8:10 p.m. UTC | #1
On Fri, Mar 29, 2024 at 09:55:02AM -0500, Bob Pearson wrote:
> This series of patches is the result of high scale testing on a large
> HPC system with a large attached Lustre file system. Several errors
> were found which had not been previously seen at smaller scales. In
> this case up to 1600 QPs on 1024 compute nodes attached to about 100
> flash storage nodes. Each patch has it's own description.
> 
> v3
> 	Fixed an error in "Don't call rxe_requester from rxe_completer"
> 	Moved run_requester_again from a global to rxe_req_info.again.
> 	The control parameter has to be local to each qp.
> v2
> 	Minor edits to some of the commit messages.
> 	Added a missing change to "Don't schedule rxe_completer...".
> 	Added a missing change to "Git rid of pkt resend on err".
> 	Added one additional commit.
> 
> Bob Pearson (12):
>   RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
>   RDMA/rxe: Allow good work requests to be executed
>   RDMA/rxe: Remove redundant scheduling of rxe_completer
>   RDMA/rxe: Merge request and complete tasks
>   RDMA/rxe: Remove save/rollback_state in rxe_requester
>   RDMA/rxe: Don't schedule rxe_completer from rxe_requester
>   RDMA/rxe: Don't call rxe_requester from rxe_completer
>   RDMA/rxe: Don't call direct between tasks
>   RDMA/rxe: Fix incorrect rxe_put in error path
>   RDMA/rxe: Make rxe_loopback match rxe_send behavior
>   RDMA/rxe: Get rid of pkt resend on err
>   RDMA/rxe: Let destroy qp succeed with stuck packet

Applied to for-next, thanks

Jason