From patchwork Fri Nov 11 09:22:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039963 Received: from esa12.hc1455-7.c3s2.iphmx.com (esa12.hc1455-7.c3s2.iphmx.com [139.138.37.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14F8781A for ; Fri, 11 Nov 2022 09:25:02 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="75166303" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="75166303" Received: from unknown (HELO yto-r4.gw.nic.fujitsu.com) ([218.44.52.220]) by esa12.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:23:52 +0900 Received: from yto-m4.gw.nic.fujitsu.com (yto-nat-yto-m4.gw.nic.fujitsu.com [192.168.83.67]) by yto-r4.gw.nic.fujitsu.com (Postfix) with ESMTP id 74752D3EA3 for ; Fri, 11 Nov 2022 18:23:52 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by yto-m4.gw.nic.fujitsu.com (Postfix) with ESMTP id C684FF0FD1 for ; Fri, 11 Nov 2022 18:23:51 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id 9083620607A2; Fri, 11 Nov 2022 18:23:51 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex Date: Fri, 11 Nov 2022 18:22:22 +0900 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 ib_umem_odp_map_dma_single_page(), which has been used only by the mlx5 driver, holds umem_mutex on success and releases on failure. This behavior is not convenient for other drivers to use it, so change it to always retain mutex on return. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/core/umem_odp.c | 8 +++----- drivers/infiniband/hw/mlx5/odp.c | 4 +++- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index e9fa22d31c23..49da6735f7c8 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -328,8 +328,8 @@ static int ib_umem_odp_map_dma_single_page( * * Maps the range passed in the argument to DMA addresses. * The DMA addresses of the mapped pages is updated in umem_odp->dma_list. - * Upon success the ODP MR will be locked to let caller complete its device - * page table update. + * The umem mutex is locked in this function. Callers are responsible for + * releasing the lock. * * Returns the number of pages mapped in success, negative error code * for failure. @@ -453,11 +453,9 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt, break; } } - /* upon success lock should stay on hold for the callee */ + if (!ret) ret = dma_index - start_idx; - else - mutex_unlock(&umem_odp->umem_mutex); out_put_mm: mmput_async(owning_mm); diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index bc97958818bb..a0de27651586 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -572,8 +572,10 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp, access_mask |= ODP_WRITE_ALLOWED_BIT; np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault); - if (np < 0) + if (np < 0) { + mutex_unlock(&odp->umem_mutex); return np; + } /* * No need to check whether the MTTs really belong to this MR, since From patchwork Fri Nov 11 09:22:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039967 Received: from esa3.hc1455-7.c3s2.iphmx.com (esa3.hc1455-7.c3s2.iphmx.com [207.54.90.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66C59258E for ; Fri, 11 Nov 2022 09:25:08 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="95684827" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="95684827" Received: from unknown (HELO yto-r2.gw.nic.fujitsu.com) ([218.44.52.218]) by esa3.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:23:57 +0900 Received: from yto-m1.gw.nic.fujitsu.com (yto-nat-yto-m1.gw.nic.fujitsu.com [192.168.83.64]) by yto-r2.gw.nic.fujitsu.com (Postfix) with ESMTP id DFAF3D619D for ; Fri, 11 Nov 2022 18:23:54 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by yto-m1.gw.nic.fujitsu.com (Postfix) with ESMTP id CAD7BCFAAE for ; Fri, 11 Nov 2022 18:23:53 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id 8AD0D20607A2; Fri, 11 Nov 2022 18:23:53 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Date: Fri, 11 Nov 2022 18:22:23 +0900 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 In order to implement On-Demand Paging on the rxe driver, triple tasklets (requester, responder, and completer) must be allowed to sleep so that they can trigger page fault when pages being accessed are not present. This patch replaces the tasklets with workqueues, but still allows direct- call of works from softirq context if it is obvious that MRs are not going to be accessed and there is no work being processed in the workqueue. As counterparts to tasklet_disable() and tasklet_enable() are missing for workqueues, an atomic value is introduced to get works suspended while qp reset is in progress. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/Makefile | 2 +- drivers/infiniband/sw/rxe/rxe_comp.c | 42 ++++--- drivers/infiniband/sw/rxe/rxe_loc.h | 4 +- drivers/infiniband/sw/rxe/rxe_net.c | 4 +- drivers/infiniband/sw/rxe/rxe_param.h | 2 +- drivers/infiniband/sw/rxe/rxe_qp.c | 71 ++++++------ drivers/infiniband/sw/rxe/rxe_recv.c | 4 +- drivers/infiniband/sw/rxe/rxe_req.c | 14 +-- drivers/infiniband/sw/rxe/rxe_resp.c | 22 ++-- drivers/infiniband/sw/rxe/rxe_verbs.c | 8 +- drivers/infiniband/sw/rxe/rxe_verbs.h | 8 +- drivers/infiniband/sw/rxe/rxe_wq.c | 160 ++++++++++++++++++++++++++ drivers/infiniband/sw/rxe/rxe_wq.h | 70 +++++++++++ 13 files changed, 329 insertions(+), 82 deletions(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile index 5395a581f4bb..358f6b06aa64 100644 --- a/drivers/infiniband/sw/rxe/Makefile +++ b/drivers/infiniband/sw/rxe/Makefile @@ -20,6 +20,6 @@ rdma_rxe-y := \ rxe_mmap.o \ rxe_icrc.o \ rxe_mcast.o \ - rxe_task.o \ + rxe_wq.o \ rxe_net.o \ rxe_hw_counters.o diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index c9170dd99f3a..880305e6e649 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -9,7 +9,7 @@ #include "rxe.h" #include "rxe_loc.h" #include "rxe_queue.h" -#include "rxe_task.h" +#include "rxe_wq.h" enum comp_state { COMPST_GET_ACK, @@ -118,21 +118,37 @@ void retransmit_timer(struct timer_list *t) if (qp->valid) { qp->comp.timeout = 1; - rxe_run_task(&qp->comp.task, 1); + rxe_run_work(&qp->comp.work, 1); } } -void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb) +void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb) { + struct rxe_qp *qp = pkt->qp; int must_sched; skb_queue_tail(&qp->resp_pkts, skb); - must_sched = skb_queue_len(&qp->resp_pkts) > 1; + /* Schedule a workqueue when processing READ and ATOMIC acks. + * In these cases, completer may sleep to access ODP-enabled MRs. + */ + switch (pkt->opcode) { + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE: + must_sched = 1; + break; + + default: + must_sched = skb_queue_len(&qp->resp_pkts) > 1; + } + if (must_sched != 0) rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED); - rxe_run_task(&qp->comp.task, must_sched); + rxe_run_work(&qp->comp.work, must_sched); } static inline enum comp_state get_wqe(struct rxe_qp *qp, @@ -313,7 +329,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp, qp->comp.psn = pkt->psn; if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(&qp->req.task, 0); + rxe_run_work(&qp->req.work, 0); } } return COMPST_ERROR_RETRY; @@ -460,7 +476,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe) */ if (qp->req.wait_fence) { qp->req.wait_fence = 0; - rxe_run_task(&qp->req.task, 0); + rxe_run_work(&qp->req.work, 0); } } @@ -474,7 +490,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp, if (qp->req.need_rd_atomic) { qp->comp.timeout_retry = 0; qp->req.need_rd_atomic = 0; - rxe_run_task(&qp->req.task, 0); + rxe_run_work(&qp->req.work, 0); } } @@ -520,7 +536,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp, if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_work(&qp->req.work, 1); } } @@ -654,7 +670,7 @@ int rxe_completer(void *arg) if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_work(&qp->req.work, 1); } state = COMPST_DONE; @@ -722,7 +738,7 @@ int rxe_completer(void *arg) RXE_CNT_COMP_RETRY); qp->req.need_retry = 1; qp->comp.started_retry = 1; - rxe_run_task(&qp->req.task, 0); + rxe_run_work(&qp->req.work, 0); } goto done; @@ -765,8 +781,8 @@ int rxe_completer(void *arg) } } - /* A non-zero return value will cause rxe_do_task to - * exit its loop and end the tasklet. A zero return + /* A non-zero return value will cause rxe_do_work to + * exit its loop and end the work. A zero return * will continue looping and return to rxe_completer */ done: diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index c2a5c8814a48..993aa6a8003d 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -179,9 +179,9 @@ int rxe_icrc_init(struct rxe_dev *rxe); int rxe_icrc_check(struct sk_buff *skb, struct rxe_pkt_info *pkt); void rxe_icrc_generate(struct sk_buff *skb, struct rxe_pkt_info *pkt); -void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb); +void rxe_resp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb); -void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb); +void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb); static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp) { diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index 35f327b9d4b8..4f74341be57a 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -345,7 +345,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb) if (unlikely(qp->need_req_skb && skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW)) - rxe_run_task(&qp->req.task, 1); + rxe_run_work(&qp->req.work, 1); rxe_put(qp); } @@ -429,7 +429,7 @@ int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt, if ((qp_type(qp) != IB_QPT_RC) && (pkt->mask & RXE_END_MASK)) { pkt->wqe->state = wqe_state_done; - rxe_run_task(&qp->comp.task, 1); + rxe_run_work(&qp->comp.work, 1); } rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS); diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h index 86c7a8bf3cbb..1c0251812fc8 100644 --- a/drivers/infiniband/sw/rxe/rxe_param.h +++ b/drivers/infiniband/sw/rxe/rxe_param.h @@ -105,7 +105,7 @@ enum rxe_device_param { RXE_INFLIGHT_SKBS_PER_QP_HIGH = 64, RXE_INFLIGHT_SKBS_PER_QP_LOW = 16, - /* Max number of interations of each tasklet + /* Max number of interations of each work * before yielding the cpu to let other * work make progress */ diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c index a62bab88415c..defa4dfe1c98 100644 --- a/drivers/infiniband/sw/rxe/rxe_qp.c +++ b/drivers/infiniband/sw/rxe/rxe_qp.c @@ -13,7 +13,7 @@ #include "rxe.h" #include "rxe_loc.h" #include "rxe_queue.h" -#include "rxe_task.h" +#include "rxe_wq.h" static int rxe_qp_chk_cap(struct rxe_dev *rxe, struct ib_qp_cap *cap, int has_srq) @@ -172,9 +172,9 @@ static void rxe_qp_init_misc(struct rxe_dev *rxe, struct rxe_qp *qp, spin_lock_init(&qp->state_lock); - spin_lock_init(&qp->req.task.state_lock); - spin_lock_init(&qp->resp.task.state_lock); - spin_lock_init(&qp->comp.task.state_lock); + spin_lock_init(&qp->req.work.state_lock); + spin_lock_init(&qp->resp.work.state_lock); + spin_lock_init(&qp->comp.work.state_lock); spin_lock_init(&qp->sq.sq_lock); spin_lock_init(&qp->rq.producer_lock); @@ -242,10 +242,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp, skb_queue_head_init(&qp->req_pkts); - rxe_init_task(&qp->req.task, qp, - rxe_requester, "req"); - rxe_init_task(&qp->comp.task, qp, - rxe_completer, "comp"); + rxe_init_work(&qp->req.work, qp, + rxe_requester, "rxe_req"); + rxe_init_work(&qp->comp.work, qp, + rxe_completer, "rxe_comp"); qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */ if (init->qp_type == IB_QPT_RC) { @@ -292,8 +292,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp, skb_queue_head_init(&qp->resp_pkts); - rxe_init_task(&qp->resp.task, qp, - rxe_responder, "resp"); + rxe_init_work(&qp->resp.work, qp, + rxe_responder, "rxe_resp"); qp->resp.opcode = OPCODE_NONE; qp->resp.msn = 0; @@ -480,14 +480,14 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp, /* move the qp to the reset state */ static void rxe_qp_reset(struct rxe_qp *qp) { - /* stop tasks from running */ - rxe_disable_task(&qp->resp.task); + /* flush workqueue and stop works from running */ + rxe_disable_work(&qp->resp.work); /* stop request/comp */ if (qp->sq.queue) { if (qp_type(qp) == IB_QPT_RC) - rxe_disable_task(&qp->comp.task); - rxe_disable_task(&qp->req.task); + rxe_disable_work(&qp->comp.work); + rxe_disable_work(&qp->req.work); } /* move qp to the reset state */ @@ -498,11 +498,11 @@ static void rxe_qp_reset(struct rxe_qp *qp) /* let state machines reset themselves drain work and packet queues * etc. */ - __rxe_do_task(&qp->resp.task); + __rxe_do_work(&qp->resp.work); if (qp->sq.queue) { - __rxe_do_task(&qp->comp.task); - __rxe_do_task(&qp->req.task); + __rxe_do_work(&qp->comp.work); + __rxe_do_work(&qp->req.work); rxe_queue_reset(qp->sq.queue); } @@ -525,14 +525,14 @@ static void rxe_qp_reset(struct rxe_qp *qp) cleanup_rd_atomic_resources(qp); - /* reenable tasks */ - rxe_enable_task(&qp->resp.task); + /* reenable workqueue */ + rxe_enable_work(&qp->resp.work); if (qp->sq.queue) { if (qp_type(qp) == IB_QPT_RC) - rxe_enable_task(&qp->comp.task); + rxe_enable_work(&qp->comp.work); - rxe_enable_task(&qp->req.task); + rxe_enable_work(&qp->req.work); } } @@ -543,10 +543,10 @@ static void rxe_qp_drain(struct rxe_qp *qp) if (qp->req.state != QP_STATE_DRAINED) { qp->req.state = QP_STATE_DRAIN; if (qp_type(qp) == IB_QPT_RC) - rxe_run_task(&qp->comp.task, 1); + rxe_run_work(&qp->comp.work, 1); else - __rxe_do_task(&qp->comp.task); - rxe_run_task(&qp->req.task, 1); + __rxe_do_work(&qp->comp.work); + rxe_run_work(&qp->req.work, 1); } } } @@ -560,13 +560,13 @@ void rxe_qp_error(struct rxe_qp *qp) qp->attr.qp_state = IB_QPS_ERR; /* drain work and packet queues */ - rxe_run_task(&qp->resp.task, 1); + rxe_run_work(&qp->resp.work, 1); if (qp_type(qp) == IB_QPT_RC) - rxe_run_task(&qp->comp.task, 1); + rxe_run_work(&qp->comp.work, 1); else - __rxe_do_task(&qp->comp.task); - rxe_run_task(&qp->req.task, 1); + __rxe_do_work(&qp->comp.work); + rxe_run_work(&qp->req.work, 1); } /* called by the modify qp verb */ @@ -785,23 +785,24 @@ static void rxe_qp_do_cleanup(struct work_struct *work) qp->valid = 0; qp->qp_timeout_jiffies = 0; - rxe_cleanup_task(&qp->resp.task); + rxe_cleanup_work(&qp->resp.work); if (qp_type(qp) == IB_QPT_RC) { del_timer_sync(&qp->retrans_timer); del_timer_sync(&qp->rnr_nak_timer); } - rxe_cleanup_task(&qp->req.task); - rxe_cleanup_task(&qp->comp.task); + rxe_cleanup_work(&qp->req.work); + rxe_cleanup_work(&qp->comp.work); /* flush out any receive wr's or pending requests */ - if (qp->req.task.func) - __rxe_do_task(&qp->req.task); + + if (qp->req.work.func) + __rxe_do_work(&qp->req.work); if (qp->sq.queue) { - __rxe_do_task(&qp->comp.task); - __rxe_do_task(&qp->req.task); + __rxe_do_work(&qp->comp.work); + __rxe_do_work(&qp->req.work); } if (qp->sq.queue) diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c index 434a693cd4a5..01d07572a56f 100644 --- a/drivers/infiniband/sw/rxe/rxe_recv.c +++ b/drivers/infiniband/sw/rxe/rxe_recv.c @@ -174,9 +174,9 @@ static int hdr_check(struct rxe_pkt_info *pkt) static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb) { if (pkt->mask & RXE_REQ_MASK) - rxe_resp_queue_pkt(pkt->qp, skb); + rxe_resp_queue_pkt(pkt, skb); else - rxe_comp_queue_pkt(pkt->qp, skb); + rxe_comp_queue_pkt(pkt, skb); } static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb) diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index f63771207970..bd05dd3de499 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -105,7 +105,7 @@ void rnr_nak_timer(struct timer_list *t) /* request a send queue retry */ qp->req.need_retry = 1; qp->req.wait_for_rnr_timer = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_work(&qp->req.work, 1); } static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp) @@ -608,7 +608,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe) * which can lead to a deadlock. So go ahead and complete * it now. */ - rxe_run_task(&qp->comp.task, 1); + rxe_run_work(&qp->comp.work, 1); return 0; } @@ -733,7 +733,7 @@ int rxe_requester(void *arg) qp->req.wqe_index); wqe->state = wqe_state_done; wqe->status = IB_WC_SUCCESS; - rxe_run_task(&qp->comp.task, 0); + rxe_run_work(&qp->comp.work, 0); goto done; } payload = mtu; @@ -795,7 +795,7 @@ int rxe_requester(void *arg) rollback_state(wqe, qp, &rollback_wqe, rollback_psn); if (err == -EAGAIN) { - rxe_run_task(&qp->req.task, 1); + rxe_run_work(&qp->req.work, 1); goto exit; } @@ -805,8 +805,8 @@ int rxe_requester(void *arg) update_state(qp, &pkt); - /* A non-zero return value will cause rxe_do_task to - * exit its loop and end the tasklet. A zero return + /* A non-zero return value will cause rxe_do_work to + * exit its loop and end the work. A zero return * will continue looping and return to rxe_requester */ done: @@ -817,7 +817,7 @@ int rxe_requester(void *arg) qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index); wqe->state = wqe_state_error; qp->req.state = QP_STATE_ERROR; - rxe_run_task(&qp->comp.task, 0); + rxe_run_work(&qp->comp.work, 0); exit: ret = -EAGAIN; out: diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 95d372db934d..91f935d60160 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -81,17 +81,17 @@ static char *resp_state_name[] = { }; /* rxe_recv calls here to add a request packet to the input queue */ -void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb) +void rxe_resp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb) { - int must_sched; - struct rxe_pkt_info *pkt = SKB_TO_PKT(skb); + int sched = 1; - skb_queue_tail(&qp->req_pkts, skb); - - must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) || - (skb_queue_len(&qp->req_pkts) > 1); - - rxe_run_task(&qp->resp.task, must_sched); + /* responder can sleep to access ODP-enabled MRs, Always schedule + * work items for non-zero-byte operations. + */ + if (unlikely(payload_size(pkt) == 0)) + sched = 0; + skb_queue_tail(&pkt->qp->req_pkts, skb); + rxe_run_work(&pkt->qp->resp.work, sched); } static inline enum resp_states get_req(struct rxe_qp *qp, @@ -1454,8 +1454,8 @@ int rxe_responder(void *arg) } } - /* A non-zero return value will cause rxe_do_task to - * exit its loop and end the tasklet. A zero return + /* A non-zero return value will cause rxe_do_work to + * exit its loop and end the work. A zero return * will continue looping and return to rxe_responder */ done: diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c index 88825edc7dce..786a3583ac21 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.c +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c @@ -695,9 +695,9 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, const struct ib_send_wr *wr, wr = next; } - rxe_run_task(&qp->req.task, 1); + rxe_run_work(&qp->req.work, 1); if (unlikely(qp->req.state == QP_STATE_ERROR)) - rxe_run_task(&qp->comp.task, 1); + rxe_run_work(&qp->comp.work, 1); return err; } @@ -719,7 +719,7 @@ static int rxe_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr, if (qp->is_user) { /* Utilize process context to do protocol processing */ - rxe_run_task(&qp->req.task, 0); + rxe_run_work(&qp->req.work, 0); return 0; } else return rxe_post_send_kernel(qp, wr, bad_wr); @@ -759,7 +759,7 @@ static int rxe_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr, spin_unlock_irqrestore(&rq->producer_lock, flags); if (qp->resp.state == QP_STATE_ERROR) - rxe_run_task(&qp->resp.task, 1); + rxe_run_work(&qp->resp.work, 1); err1: return err; diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h index 22a299b0a9f0..60d0cdb5465a 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.h +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h @@ -10,7 +10,7 @@ #include #include #include "rxe_pool.h" -#include "rxe_task.h" +#include "rxe_wq.h" #include "rxe_hw_counters.h" static inline int pkey_match(u16 key1, u16 key2) @@ -125,7 +125,7 @@ struct rxe_req_info { int need_retry; int wait_for_rnr_timer; int noack_pkts; - struct rxe_task task; + struct rxe_work work; }; struct rxe_comp_info { @@ -137,7 +137,7 @@ struct rxe_comp_info { int started_retry; u32 retry_cnt; u32 rnr_retry; - struct rxe_task task; + struct rxe_work work; }; enum rdatm_res_state { @@ -204,7 +204,7 @@ struct rxe_resp_info { unsigned int res_head; unsigned int res_tail; struct resp_res *res; - struct rxe_task task; + struct rxe_work work; }; struct rxe_qp { diff --git a/drivers/infiniband/sw/rxe/rxe_wq.c b/drivers/infiniband/sw/rxe/rxe_wq.c new file mode 100644 index 000000000000..d976fb413396 --- /dev/null +++ b/drivers/infiniband/sw/rxe/rxe_wq.c @@ -0,0 +1,160 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved. + */ + +#include +#include +#include + +#include "rxe.h" + +int __rxe_do_work(struct rxe_work *work) + +{ + int ret; + + while ((ret = work->func(work->arg)) == 0) + ; + + work->ret = ret; + + return ret; +} + +/* + * this locking is due to a potential race where + * a second caller finds the work already running + * but looks just after the last call to func + */ +void rxe_do_work(struct work_struct *w) +{ + int cont; + int ret; + + struct rxe_work *work = container_of(w, typeof(*work), work); + unsigned int iterations = RXE_MAX_ITERATIONS; + + spin_lock_bh(&work->state_lock); + switch (work->state) { + case WQ_STATE_START: + work->state = WQ_STATE_BUSY; + spin_unlock_bh(&work->state_lock); + break; + + case WQ_STATE_BUSY: + work->state = WQ_STATE_ARMED; + fallthrough; + case WQ_STATE_ARMED: + spin_unlock_bh(&work->state_lock); + return; + + default: + spin_unlock_bh(&work->state_lock); + pr_warn("%s failed with bad state %d\n", __func__, work->state); + return; + } + + do { + cont = 0; + ret = work->func(work->arg); + + spin_lock_bh(&work->state_lock); + switch (work->state) { + case WQ_STATE_BUSY: + if (ret) { + work->state = WQ_STATE_START; + } else if (iterations--) { + cont = 1; + } else { + /* reschedule the work and exit + * the loop to give up the cpu + */ + queue_work(work->worker, &work->work); + work->state = WQ_STATE_START; + } + break; + + /* someone tried to run the work since the last time we called + * func, so we will call one more time regardless of the + * return value + */ + case WQ_STATE_ARMED: + work->state = WQ_STATE_BUSY; + cont = 1; + break; + + default: + pr_warn("%s failed with bad state %d\n", __func__, + work->state); + } + spin_unlock_bh(&work->state_lock); + } while (cont); + + work->ret = ret; +} + +int rxe_init_work(struct rxe_work *work, + void *arg, int (*func)(void *), char *name) +{ + work->arg = arg; + work->func = func; + snprintf(work->name, sizeof(work->name), "%s", name); + work->destroyed = false; + atomic_set(&work->suspended, 0); + + work->worker = create_singlethread_workqueue(name); + INIT_WORK(&work->work, rxe_do_work); + + work->state = WQ_STATE_START; + spin_lock_init(&work->state_lock); + + return 0; +} + +void rxe_cleanup_work(struct rxe_work *work) +{ + bool idle; + + /* + * Mark the work, then wait for it to finish. It might be + * running in a non-workqueue (direct call) context. + */ + work->destroyed = true; + flush_workqueue(work->worker); + + do { + spin_lock_bh(&work->state_lock); + idle = (work->state == WQ_STATE_START); + spin_unlock_bh(&work->state_lock); + } while (!idle); + + destroy_workqueue(work->worker); +} + +void rxe_run_work(struct rxe_work *work, int sched) +{ + if (work->destroyed) + return; + + /* busy-loop while qp reset is in progress */ + while (atomic_read(&work->suspended)) + continue; + + if (sched) + queue_work(work->worker, &work->work); + else + rxe_do_work(&work->work); +} + +void rxe_disable_work(struct rxe_work *work) +{ + atomic_inc(&work->suspended); + flush_workqueue(work->worker); +} + +void rxe_enable_work(struct rxe_work *work) +{ + atomic_dec(&work->suspended); +} diff --git a/drivers/infiniband/sw/rxe/rxe_wq.h b/drivers/infiniband/sw/rxe/rxe_wq.h new file mode 100644 index 000000000000..03b5579cf2ca --- /dev/null +++ b/drivers/infiniband/sw/rxe/rxe_wq.h @@ -0,0 +1,70 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved. + */ + +#ifndef RXE_WQ_H +#define RXE_WQ_H + +enum { + WQ_STATE_START = 0, + WQ_STATE_BUSY = 1, + WQ_STATE_ARMED = 2, +}; + +/* + * data structure to describe a 'work' which is a short + * function that returns 0 as long as it needs to be + * called again. + */ +struct rxe_work { + struct workqueue_struct *worker; + struct work_struct work; + int state; + spinlock_t state_lock; /* spinlock for work state */ + void *arg; + int (*func)(void *arg); + int ret; + char name[16]; + bool destroyed; + atomic_t suspended; /* used to {dis,en}able workqueue */ +}; + +/* + * init rxe_work structure + * arg => parameter to pass to fcn + * func => function to call until it returns != 0 + */ +int rxe_init_work(struct rxe_work *work, + void *arg, int (*func)(void *), char *name); + +/* cleanup work */ +void rxe_cleanup_work(struct rxe_work *work); + +/* + * raw call to func in loop without any checking + * can call when workqueues are suspended. + */ +int __rxe_do_work(struct rxe_work *work); + +/* + * common function called by any of the main workqueues + * If there is any chance that there is additional + * work to do someone must reschedule the work before + * leaving + */ +void rxe_do_work(struct work_struct *w); + +/* run a work, else schedule it to run as a workqueue, The decision + * to run or schedule workqueue is based on the parameter sched. + */ +void rxe_run_work(struct rxe_work *work, int sched); + +/* keep a work from scheduling */ +void rxe_disable_work(struct rxe_work *work); + +/* allow work to run */ +void rxe_enable_work(struct rxe_work *work); + +#endif /* RXE_WQ_H */ From patchwork Fri Nov 11 09:22:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039968 Received: from esa4.hc1455-7.c3s2.iphmx.com (esa4.hc1455-7.c3s2.iphmx.com [68.232.139.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66A7E2590 for ; Fri, 11 Nov 2022 09:25:09 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="95511337" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="95511337" Received: from unknown (HELO oym-r3.gw.nic.fujitsu.com) ([210.162.30.91]) by esa4.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:23:59 +0900 Received: from oym-m2.gw.nic.fujitsu.com (oym-nat-oym-m2.gw.nic.fujitsu.com [192.168.87.59]) by oym-r3.gw.nic.fujitsu.com (Postfix) with ESMTP id CDF89D647B for ; Fri, 11 Nov 2022 18:23:56 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by oym-m2.gw.nic.fujitsu.com (Postfix) with ESMTP id 021A6BCB78 for ; Fri, 11 Nov 2022 18:23:56 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id AFF8D20607A2; Fri, 11 Nov 2022 18:23:55 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 3/7] RDMA/rxe: Cleanup code for responder Atomic operations Date: Fri, 11 Nov 2022 18:22:24 +0900 Message-Id: <7f0b8e11ed7dfcbfb8ce6198680745cc6d106783.1668157436.git.matsuda-daisuke@fujitsu.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 Currently, rxe_responder() directly calls the function to execute Atomic operations. This need to be modified to insert some conditional branches for the new RDMA Write operation and the ODP feature. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/rxe_resp.c | 102 +++++++++++++++++---------- 1 file changed, 64 insertions(+), 38 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 91f935d60160..03e54cb37d44 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -595,60 +595,86 @@ static struct resp_res *rxe_prepare_res(struct rxe_qp *qp, /* Guarantee atomicity of atomic operations at the machine level. */ static DEFINE_SPINLOCK(atomic_ops_lock); -static enum resp_states atomic_reply(struct rxe_qp *qp, - struct rxe_pkt_info *pkt) +enum resp_states rxe_process_atomic(struct rxe_qp *qp, + struct rxe_pkt_info *pkt, u64 *vaddr) { - u64 *vaddr; enum resp_states ret; - struct rxe_mr *mr = qp->resp.mr; struct resp_res *res = qp->resp.res; u64 value; - if (!res) { - res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK); - qp->resp.res = res; + /* check vaddr is 8 bytes aligned. */ + if (!vaddr || (uintptr_t)vaddr & 7) { + ret = RESPST_ERR_MISALIGNED_ATOMIC; + goto out; } - if (!res->replay) { - if (mr->state != RXE_MR_STATE_VALID) { - ret = RESPST_ERR_RKEY_VIOLATION; - goto out; - } + spin_lock(&atomic_ops_lock); + res->atomic.orig_val = value = *vaddr; - vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, - sizeof(u64)); + if (pkt->opcode == IB_OPCODE_RC_COMPARE_SWAP) { + if (value == atmeth_comp(pkt)) + value = atmeth_swap_add(pkt); + } else { + value += atmeth_swap_add(pkt); + } - /* check vaddr is 8 bytes aligned. */ - if (!vaddr || (uintptr_t)vaddr & 7) { - ret = RESPST_ERR_MISALIGNED_ATOMIC; - goto out; - } + *vaddr = value; + spin_unlock(&atomic_ops_lock); - spin_lock_bh(&atomic_ops_lock); - res->atomic.orig_val = value = *vaddr; + qp->resp.msn++; - if (pkt->opcode == IB_OPCODE_RC_COMPARE_SWAP) { - if (value == atmeth_comp(pkt)) - value = atmeth_swap_add(pkt); - } else { - value += atmeth_swap_add(pkt); - } + /* next expected psn, read handles this separately */ + qp->resp.psn = (pkt->psn + 1) & BTH_PSN_MASK; + qp->resp.ack_psn = qp->resp.psn; - *vaddr = value; - spin_unlock_bh(&atomic_ops_lock); + qp->resp.opcode = pkt->opcode; + qp->resp.status = IB_WC_SUCCESS; - qp->resp.msn++; + ret = RESPST_ACKNOWLEDGE; +out: + return ret; +} - /* next expected psn, read handles this separately */ - qp->resp.psn = (pkt->psn + 1) & BTH_PSN_MASK; - qp->resp.ack_psn = qp->resp.psn; +static enum resp_states rxe_atomic_ops(struct rxe_qp *qp, + struct rxe_pkt_info *pkt, + struct rxe_mr *mr) +{ + u64 *vaddr; + int ret; - qp->resp.opcode = pkt->opcode; - qp->resp.status = IB_WC_SUCCESS; + vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, + sizeof(u64)); + + if (pkt->mask & RXE_ATOMIC_MASK) { + ret = rxe_process_atomic(qp, pkt, vaddr); + } else { + /*ATOMIC WRITE operation will come here. */ + ret = RESPST_ERR_UNSUPPORTED_OPCODE; } - ret = RESPST_ACKNOWLEDGE; -out: + return ret; +} + +static enum resp_states rxe_atomic_reply(struct rxe_qp *qp, + struct rxe_pkt_info *pkt) +{ + struct rxe_mr *mr = qp->resp.mr; + struct resp_res *res = qp->resp.res; + int ret; + + if (!res) { + res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK); + qp->resp.res = res; + } + + if (!res->replay) { + if (mr->state != RXE_MR_STATE_VALID) + return RESPST_ERR_RKEY_VIOLATION; + + ret = rxe_atomic_ops(qp, pkt, mr); + } else + ret = RESPST_ACKNOWLEDGE; + return ret; } @@ -1321,7 +1347,7 @@ int rxe_responder(void *arg) state = read_reply(qp, pkt); break; case RESPST_ATOMIC_REPLY: - state = atomic_reply(qp, pkt); + state = rxe_atomic_reply(qp, pkt); break; case RESPST_ACKNOWLEDGE: state = acknowledge(qp, pkt); From patchwork Fri Nov 11 09:22:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039965 Received: from esa12.hc1455-7.c3s2.iphmx.com (esa12.hc1455-7.c3s2.iphmx.com [139.138.37.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 887D52108 for ; Fri, 11 Nov 2022 09:25:05 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="75166324" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="75166324" Received: from unknown (HELO yto-r2.gw.nic.fujitsu.com) ([218.44.52.218]) by esa12.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:23:58 +0900 Received: from yto-m2.gw.nic.fujitsu.com (yto-nat-yto-m2.gw.nic.fujitsu.com [192.168.83.65]) by yto-r2.gw.nic.fujitsu.com (Postfix) with ESMTP id 028B4DE50F for ; Fri, 11 Nov 2022 18:23:58 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by yto-m2.gw.nic.fujitsu.com (Postfix) with ESMTP id 50C03E535F for ; Fri, 11 Nov 2022 18:23:57 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id 19F8C20607A2; Fri, 11 Nov 2022 18:23:57 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 4/7] RDMA/rxe: Add page invalidation support Date: Fri, 11 Nov 2022 18:22:25 +0900 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 On page invalidation, an MMU notifier callback is invoked to unmap DMA addresses and update umem_odp->dma_list. The callback is registered when an ODP-enabled MR is created. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/Makefile | 3 ++- drivers/infiniband/sw/rxe/rxe_odp.c | 34 +++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 1 deletion(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile index 358f6b06aa64..924f4acb2816 100644 --- a/drivers/infiniband/sw/rxe/Makefile +++ b/drivers/infiniband/sw/rxe/Makefile @@ -22,4 +22,5 @@ rdma_rxe-y := \ rxe_mcast.o \ rxe_wq.o \ rxe_net.o \ - rxe_hw_counters.o + rxe_hw_counters.o \ + rxe_odp.o diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c new file mode 100644 index 000000000000..0787a9b19646 --- /dev/null +++ b/drivers/infiniband/sw/rxe/rxe_odp.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* + * Copyright (c) 2022 Fujitsu Ltd. All rights reserved. + */ + +#include + +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range, + unsigned long cur_seq) +{ + struct ib_umem_odp *umem_odp = + container_of(mni, struct ib_umem_odp, notifier); + unsigned long start; + unsigned long end; + + if (!mmu_notifier_range_blockable(range)) + return false; + + mutex_lock(&umem_odp->umem_mutex); + mmu_interval_set_seq(mni, cur_seq); + + start = max_t(u64, ib_umem_start(umem_odp), range->start); + end = min_t(u64, ib_umem_end(umem_odp), range->end); + + ib_umem_odp_unmap_dma_pages(umem_odp, start, end); + + mutex_unlock(&umem_odp->umem_mutex); + return true; +} + +const struct mmu_interval_notifier_ops rxe_mn_ops = { + .invalidate = rxe_ib_invalidate_range, +}; From patchwork Fri Nov 11 09:22:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039969 Received: from esa2.hc1455-7.c3s2.iphmx.com (esa2.hc1455-7.c3s2.iphmx.com [207.54.90.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83E162C80 for ; Fri, 11 Nov 2022 09:25:12 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="95532463" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="95532463" Received: from unknown (HELO yto-r4.gw.nic.fujitsu.com) ([218.44.52.220]) by esa2.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:24:00 +0900 Received: from yto-m2.gw.nic.fujitsu.com (yto-nat-yto-m2.gw.nic.fujitsu.com [192.168.83.65]) by yto-r4.gw.nic.fujitsu.com (Postfix) with ESMTP id 588A3D3EA3 for ; Fri, 11 Nov 2022 18:23:59 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by yto-m2.gw.nic.fujitsu.com (Postfix) with ESMTP id 986949FDD4 for ; Fri, 11 Nov 2022 18:23:58 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id 61DD920607A2; Fri, 11 Nov 2022 18:23:58 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging Date: Fri, 11 Nov 2022 18:22:26 +0900 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 Allow applications to register an ODP-enabled MR, in which case the flag IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no RDMA operation supported right now. They will be enabled later in the subsequent two patches. rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs process address space from the CPU page table to the driver page table (dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page fault when pages being accessed are not present or do not have proper read/write permissions, and possibly to prefetch pages in the future. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/rxe.c | 7 +++ drivers/infiniband/sw/rxe/rxe_loc.h | 5 ++ drivers/infiniband/sw/rxe/rxe_mr.c | 7 ++- drivers/infiniband/sw/rxe/rxe_odp.c | 81 +++++++++++++++++++++++++++ drivers/infiniband/sw/rxe/rxe_resp.c | 25 +++++++-- drivers/infiniband/sw/rxe/rxe_verbs.c | 8 ++- drivers/infiniband/sw/rxe/rxe_verbs.h | 2 + 7 files changed, 126 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c index 51daac5c4feb..0719f451253c 100644 --- a/drivers/infiniband/sw/rxe/rxe.c +++ b/drivers/infiniband/sw/rxe/rxe.c @@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe) rxe->ndev->dev_addr); rxe->max_ucontext = RXE_MAX_UCONTEXT; + + if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) { + rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING; + + /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */ + rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT; + } } /* initialize port attributes */ diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index 993aa6a8003d..3cf830ee2081 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -64,6 +64,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); /* rxe_mr.c */ u8 rxe_get_next_key(u32 last_key); +void rxe_mr_init(int access, struct rxe_mr *mr); void rxe_mr_init_dma(int access, struct rxe_mr *mr); int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova, int access, struct rxe_mr *mr); @@ -188,4 +189,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp) return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type]; } +/* rxe_odp.c */ +int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova, + int access_flags, struct rxe_mr *mr); + #endif /* RXE_LOC_H */ diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c index d4f10c2d1aa7..dd0d68d61bc4 100644 --- a/drivers/infiniband/sw/rxe/rxe_mr.c +++ b/drivers/infiniband/sw/rxe/rxe_mr.c @@ -48,7 +48,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length) | IB_ACCESS_REMOTE_WRITE \ | IB_ACCESS_REMOTE_ATOMIC) -static void rxe_mr_init(int access, struct rxe_mr *mr) +void rxe_mr_init(int access, struct rxe_mr *mr) { u32 lkey = mr->elem.index << 8 | rxe_get_next_key(-1); u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0; @@ -433,7 +433,10 @@ int copy_data( if (bytes > 0) { iova = sge->addr + offset; - err = rxe_mr_copy(mr, iova, addr, bytes, dir); + if (mr->odp_enabled) + err = -EOPNOTSUPP; + else + err = rxe_mr_copy(mr, iova, addr, bytes, dir); if (err) goto err2; diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c index 0787a9b19646..50766889f61a 100644 --- a/drivers/infiniband/sw/rxe/rxe_odp.c +++ b/drivers/infiniband/sw/rxe/rxe_odp.c @@ -5,6 +5,8 @@ #include +#include "rxe.h" + static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni, const struct mmu_notifier_range *range, unsigned long cur_seq) @@ -32,3 +34,82 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni, const struct mmu_interval_notifier_ops rxe_mn_ops = { .invalidate = rxe_ib_invalidate_range, }; + +#define RXE_PAGEFAULT_RDONLY BIT(1) +#define RXE_PAGEFAULT_SNAPSHOT BIT(2) +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags) +{ + int np; + u64 access_mask; + bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT); + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + + access_mask = ODP_READ_ALLOWED_BIT; + if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY)) + access_mask |= ODP_WRITE_ALLOWED_BIT; + + /* + * umem mutex is always locked in ib_umem_odp_map_dma_and_lock(). + * Callers must release the lock later to let invalidation handler + * do its work again. + */ + np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt, + access_mask, fault); + + return np; +} + +static int rxe_init_odp_mr(struct rxe_mr *mr) +{ + int ret; + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + + ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length, + RXE_PAGEFAULT_SNAPSHOT); + mutex_unlock(&umem_odp->umem_mutex); + + return ret >= 0 ? 0 : ret; +} + +int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova, + int access_flags, struct rxe_mr *mr) +{ + int err; + struct ib_umem_odp *umem_odp; + struct rxe_dev *dev = container_of(pd->device, struct rxe_dev, ib_dev); + + if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) + return -EOPNOTSUPP; + + rxe_mr_init(access_flags, mr); + + if (!start && length == U64_MAX) { + if (iova != 0) + return -EINVAL; + if (!(dev->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT)) + return -EINVAL; + + /* Never reach here, for implicit ODP is not implemented. */ + } + + umem_odp = ib_umem_odp_get(pd->device, start, length, access_flags, + &rxe_mn_ops); + if (IS_ERR(umem_odp)) + return PTR_ERR(umem_odp); + + umem_odp->private = mr; + + mr->odp_enabled = true; + mr->ibmr.pd = pd; + mr->umem = &umem_odp->umem; + mr->access = access_flags; + mr->ibmr.length = length; + mr->ibmr.iova = iova; + mr->offset = ib_umem_offset(&umem_odp->umem); + mr->state = RXE_MR_STATE_VALID; + mr->ibmr.type = IB_MR_TYPE_USER; + + err = rxe_init_odp_mr(mr); + + return err; +} diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 03e54cb37d44..4beb18f8bea8 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -539,8 +539,16 @@ static enum resp_states write_data_in(struct rxe_qp *qp, int err; int data_len = payload_size(pkt); - err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset, - payload_addr(pkt), data_len, RXE_TO_MR_OBJ); + /* resp.mr is not set in check_rkey() for zero byte operations */ + if (data_len == 0) + goto out; + + if (qp->resp.mr->odp_enabled) + err = -EOPNOTSUPP; + else + err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset, + payload_addr(pkt), data_len, RXE_TO_MR_OBJ); + if (err) { rc = RESPST_ERR_RKEY_VIOLATION; goto out; @@ -671,7 +679,10 @@ static enum resp_states rxe_atomic_reply(struct rxe_qp *qp, if (mr->state != RXE_MR_STATE_VALID) return RESPST_ERR_RKEY_VIOLATION; - ret = rxe_atomic_ops(qp, pkt, mr); + if (mr->odp_enabled) + ret = RESPST_ERR_UNSUPPORTED_OPCODE; + else + ret = rxe_atomic_ops(qp, pkt, mr); } else ret = RESPST_ACKNOWLEDGE; @@ -835,8 +846,12 @@ static enum resp_states read_reply(struct rxe_qp *qp, if (!skb) return RESPST_ERR_RNR; - err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt), - payload, RXE_FROM_MR_OBJ); + /* mr is NULL for a zero byte operation. */ + if ((res->read.resid != 0) && mr->odp_enabled) + err = -EOPNOTSUPP; + else + err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt), + payload, RXE_FROM_MR_OBJ); rxe_put(mr); if (err) { kfree_skb(skb); diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c index 786a3583ac21..0876b17c83c1 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.c +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c @@ -926,11 +926,15 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, goto err2; } - rxe_get(pd); mr->ibmr.pd = ibpd; - err = rxe_mr_init_user(rxe, start, length, iova, access, mr); + if (access & IB_ACCESS_ON_DEMAND) + err = rxe_create_user_odp_mr(&pd->ibpd, start, length, iova, + access, mr); + else + err = rxe_mr_init_user(rxe, start, length, iova, access, mr); + if (err) goto err3; diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h index 60d0cdb5465a..02d079d9dc54 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.h +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h @@ -321,6 +321,8 @@ struct rxe_mr { atomic_t num_mw; struct rxe_map **map; + + bool odp_enabled; }; enum rxe_mw_state { From patchwork Fri Nov 11 09:22:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039966 Received: from esa12.hc1455-7.c3s2.iphmx.com (esa12.hc1455-7.c3s2.iphmx.com [139.138.37.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C52E210C for ; Fri, 11 Nov 2022 09:25:05 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="75166335" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="75166335" Received: from unknown (HELO oym-r3.gw.nic.fujitsu.com) ([210.162.30.91]) by esa12.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:24:01 +0900 Received: from oym-m3.gw.nic.fujitsu.com (oym-nat-oym-m3.gw.nic.fujitsu.com [192.168.87.60]) by oym-r3.gw.nic.fujitsu.com (Postfix) with ESMTP id 6B738D6478 for ; Fri, 11 Nov 2022 18:24:01 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by oym-m3.gw.nic.fujitsu.com (Postfix) with ESMTP id 9CA8ED562F for ; Fri, 11 Nov 2022 18:24:00 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id 523E120607A2; Fri, 11 Nov 2022 18:24:00 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP Date: Fri, 11 Nov 2022 18:22:27 +0900 Message-Id: <8ef75905105adca00151fd1371a44c5a80e68532.1668157436.git.matsuda-daisuke@fujitsu.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses it to load payloads of requesting packets; responder uses it to process Send, Write, and Read operaetions; completer uses it to copy data from response packets of Read and Atomic operations to a user MR. Allow these operations to be used with ODP by adding a counterpart function rxe_odp_mr_copy(). It is comprised of the following steps: 1. Check the driver page table(umem_odp->dma_list) to see if pages being accessed are present with appropriate permission. 2. If necessary, trigger page fault to map the pages. 3. Convert their user space addresses to kernel logical addresses using PFNs in the driver page table(umem_odp->pfn_list). 4. Execute data copy fo/from the pages. umem_mutex is used to ensure that dma_list (an array of addresses of an MR) is not changed while it is checked and that mapped pages are not invalidated before data copy completes. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/rxe.c | 10 ++ drivers/infiniband/sw/rxe/rxe_loc.h | 2 + drivers/infiniband/sw/rxe/rxe_mr.c | 2 +- drivers/infiniband/sw/rxe/rxe_odp.c | 176 +++++++++++++++++++++++++++ drivers/infiniband/sw/rxe/rxe_resp.c | 42 +------ drivers/infiniband/sw/rxe/rxe_resp.h | 41 +++++++ 6 files changed, 235 insertions(+), 38 deletions(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c index 0719f451253c..dd287fc60e9d 100644 --- a/drivers/infiniband/sw/rxe/rxe.c +++ b/drivers/infiniband/sw/rxe/rxe.c @@ -79,6 +79,16 @@ static void rxe_init_device_param(struct rxe_dev *rxe) /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */ rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT; + + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND; + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV; + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV; + + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND; + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV; + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE; + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ; + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV; } } diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index 3cf830ee2081..8b19b6fdc497 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -192,5 +192,7 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp) /* rxe_odp.c */ int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova, int access_flags, struct rxe_mr *mr); +int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length, + enum rxe_mr_copy_dir dir); #endif /* RXE_LOC_H */ diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c index dd0d68d61bc4..03ba66a38162 100644 --- a/drivers/infiniband/sw/rxe/rxe_mr.c +++ b/drivers/infiniband/sw/rxe/rxe_mr.c @@ -434,7 +434,7 @@ int copy_data( iova = sge->addr + offset; if (mr->odp_enabled) - err = -EOPNOTSUPP; + err = rxe_odp_mr_copy(mr, iova, addr, bytes, dir); else err = rxe_mr_copy(mr, iova, addr, bytes, dir); if (err) diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c index 50766889f61a..ba4723818ee7 100644 --- a/drivers/infiniband/sw/rxe/rxe_odp.c +++ b/drivers/infiniband/sw/rxe/rxe_odp.c @@ -3,9 +3,12 @@ * Copyright (c) 2022 Fujitsu Ltd. All rights reserved. */ +#include + #include #include "rxe.h" +#include "rxe_resp.h" static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni, const struct mmu_notifier_range *range, @@ -113,3 +116,176 @@ int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova, return err; } + +static inline bool rxe_is_pagefault_neccesary(struct ib_umem_odp *umem_odp, + u64 iova, int length, u32 perm) +{ + int idx; + u64 addr; + bool need_fault = false; + + addr = iova & (~(BIT(umem_odp->page_shift) - 1)); + + /* Skim through all pages that are to be accessed. */ + while (addr < iova + length) { + idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift; + + if (!(umem_odp->dma_list[idx] & perm)) { + need_fault = true; + break; + } + + addr += BIT(umem_odp->page_shift); + } + return need_fault; +} + +/* umem mutex must be locked when entering/exiting this function. */ +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags) +{ + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + const int max_tries = 3; + int cnt = 0; + + int err; + u64 perm; + bool need_fault; + + if (unlikely(length < 1)) + return -EINVAL; + + perm = ODP_READ_ALLOWED_BIT; + if (!(flags & RXE_PAGEFAULT_RDONLY)) + perm |= ODP_WRITE_ALLOWED_BIT; + + /* + * A successful return from rxe_odp_do_pagefault() does not guarantee + * that all pages in the range became present. Recheck the DMA address + * array, allowing max 3 tries for pagefault. + */ + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp, + iova, length, perm))) { + + if (cnt >= max_tries) + break; + + mutex_unlock(&umem_odp->umem_mutex); + + /* rxe_odp_do_pagefault() locks the umem mutex. */ + err = rxe_odp_do_pagefault(mr, iova, length, flags); + if (err < 0) + return err; + + cnt++; + } + + if (need_fault) + return -EFAULT; + + return 0; +} + +static inline void *rxe_odp_get_virt(struct ib_umem_odp *umem_odp, int umem_idx, + size_t offset) +{ + struct page *page; + void *virt; + + /* + * Step 1. Get page struct from the pfn array. + * Step 2. Convert page struct to kernel logical address. + * Step 3. Add offset in the page to the address. + */ + page = hmm_pfn_to_page(umem_odp->pfn_list[umem_idx]); + virt = page_address(page); + + if (!virt) + return NULL; + + virt += offset; + + return virt; +} + +static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, + int length, enum rxe_mr_copy_dir dir) +{ + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + + int idx, bytes; + u8 *user_va; + size_t offset; + + idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift; + offset = iova & (BIT(umem_odp->page_shift) - 1); + + while (length > 0) { + u8 *src, *dest; + + user_va = (u8 *)rxe_odp_get_virt(umem_odp, idx, offset); + if (!user_va) + return -EFAULT; + + src = (dir == RXE_TO_MR_OBJ) ? addr : user_va; + dest = (dir == RXE_TO_MR_OBJ) ? user_va : addr; + + bytes = BIT(umem_odp->page_shift) - offset; + + if (bytes > length) + bytes = length; + + memcpy(dest, src, bytes); + + length -= bytes; + idx++; + offset = 0; + } + + return 0; +} + +int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length, + enum rxe_mr_copy_dir dir) +{ + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + u32 flags = 0; + + int err; + + if (length == 0) + return 0; + + if (unlikely(!mr->odp_enabled)) + return RESPST_ERR_RKEY_VIOLATION; + + switch (dir) { + case RXE_TO_MR_OBJ: + break; + + case RXE_FROM_MR_OBJ: + flags = RXE_PAGEFAULT_RDONLY; + break; + + default: + return -EINVAL; + } + + /* If pagefault is not required, umem mutex will be held until data + * copy to the MR completes. Otherwise, it is released and locked + * again in rxe_odp_map_range() to let invalidation handler do its + * work meanwhile. + */ + mutex_lock(&umem_odp->umem_mutex); + + err = rxe_odp_map_range(mr, iova, length, flags); + if (err) { + mutex_unlock(&umem_odp->umem_mutex); + return err; + } + + err = __rxe_odp_mr_copy(mr, iova, addr, length, dir); + + mutex_unlock(&umem_odp->umem_mutex); + + return err; +} diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 4beb18f8bea8..296b9ccee330 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -9,41 +9,7 @@ #include "rxe.h" #include "rxe_loc.h" #include "rxe_queue.h" - -enum resp_states { - RESPST_NONE, - RESPST_GET_REQ, - RESPST_CHK_PSN, - RESPST_CHK_OP_SEQ, - RESPST_CHK_OP_VALID, - RESPST_CHK_RESOURCE, - RESPST_CHK_LENGTH, - RESPST_CHK_RKEY, - RESPST_EXECUTE, - RESPST_READ_REPLY, - RESPST_ATOMIC_REPLY, - RESPST_COMPLETE, - RESPST_ACKNOWLEDGE, - RESPST_CLEANUP, - RESPST_DUPLICATE_REQUEST, - RESPST_ERR_MALFORMED_WQE, - RESPST_ERR_UNSUPPORTED_OPCODE, - RESPST_ERR_MISALIGNED_ATOMIC, - RESPST_ERR_PSN_OUT_OF_SEQ, - RESPST_ERR_MISSING_OPCODE_FIRST, - RESPST_ERR_MISSING_OPCODE_LAST_C, - RESPST_ERR_MISSING_OPCODE_LAST_D1E, - RESPST_ERR_TOO_MANY_RDMA_ATM_REQ, - RESPST_ERR_RNR, - RESPST_ERR_RKEY_VIOLATION, - RESPST_ERR_INVALIDATE_RKEY, - RESPST_ERR_LENGTH, - RESPST_ERR_CQ_OVERFLOW, - RESPST_ERROR, - RESPST_RESET, - RESPST_DONE, - RESPST_EXIT, -}; +#include "rxe_resp.h" static char *resp_state_name[] = { [RESPST_NONE] = "NONE", @@ -544,7 +510,8 @@ static enum resp_states write_data_in(struct rxe_qp *qp, goto out; if (qp->resp.mr->odp_enabled) - err = -EOPNOTSUPP; + err = rxe_odp_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset, + payload_addr(pkt), data_len, RXE_TO_MR_OBJ); else err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset, payload_addr(pkt), data_len, RXE_TO_MR_OBJ); @@ -848,7 +815,8 @@ static enum resp_states read_reply(struct rxe_qp *qp, /* mr is NULL for a zero byte operation. */ if ((res->read.resid != 0) && mr->odp_enabled) - err = -EOPNOTSUPP; + err = rxe_odp_mr_copy(mr, res->read.va, payload_addr(&ack_pkt), + payload, RXE_FROM_MR_OBJ); else err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt), payload, RXE_FROM_MR_OBJ); diff --git a/drivers/infiniband/sw/rxe/rxe_resp.h b/drivers/infiniband/sw/rxe/rxe_resp.h new file mode 100644 index 000000000000..121f0b998196 --- /dev/null +++ b/drivers/infiniband/sw/rxe/rxe_resp.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ + +#ifndef RXE_RESP_H +#define RXE_RESP_H + +enum resp_states { + RESPST_NONE, + RESPST_GET_REQ, + RESPST_CHK_PSN, + RESPST_CHK_OP_SEQ, + RESPST_CHK_OP_VALID, + RESPST_CHK_RESOURCE, + RESPST_CHK_LENGTH, + RESPST_CHK_RKEY, + RESPST_EXECUTE, + RESPST_READ_REPLY, + RESPST_ATOMIC_REPLY, + RESPST_COMPLETE, + RESPST_ACKNOWLEDGE, + RESPST_CLEANUP, + RESPST_DUPLICATE_REQUEST, + RESPST_ERR_MALFORMED_WQE, + RESPST_ERR_UNSUPPORTED_OPCODE, + RESPST_ERR_MISALIGNED_ATOMIC, + RESPST_ERR_PSN_OUT_OF_SEQ, + RESPST_ERR_MISSING_OPCODE_FIRST, + RESPST_ERR_MISSING_OPCODE_LAST_C, + RESPST_ERR_MISSING_OPCODE_LAST_D1E, + RESPST_ERR_TOO_MANY_RDMA_ATM_REQ, + RESPST_ERR_RNR, + RESPST_ERR_RKEY_VIOLATION, + RESPST_ERR_INVALIDATE_RKEY, + RESPST_ERR_LENGTH, + RESPST_ERR_CQ_OVERFLOW, + RESPST_ERROR, + RESPST_RESET, + RESPST_DONE, + RESPST_EXIT, +}; + +#endif /* RXE_RESP_H */ From patchwork Fri Nov 11 09:22:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Daisuke Matsuda (Fujitsu)" X-Patchwork-Id: 13039970 Received: from esa9.hc1455-7.c3s2.iphmx.com (esa9.hc1455-7.c3s2.iphmx.com [139.138.36.223]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A07E2F50 for ; Fri, 11 Nov 2022 09:25:15 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6500,9779,10527"; a="83748143" X-IronPort-AV: E=Sophos;i="5.96,156,1665414000"; d="scan'208";a="83748143" Received: from unknown (HELO yto-r2.gw.nic.fujitsu.com) ([218.44.52.218]) by esa9.hc1455-7.c3s2.iphmx.com with ESMTP; 11 Nov 2022 18:24:03 +0900 Received: from yto-m4.gw.nic.fujitsu.com (yto-nat-yto-m4.gw.nic.fujitsu.com [192.168.83.67]) by yto-r2.gw.nic.fujitsu.com (Postfix) with ESMTP id A9FDADE50E for ; Fri, 11 Nov 2022 18:24:02 +0900 (JST) Received: from m3004.s.css.fujitsu.com (m3004.s.css.fujitsu.com [10.128.233.124]) by yto-m4.gw.nic.fujitsu.com (Postfix) with ESMTP id E9F5FF0FD9 for ; Fri, 11 Nov 2022 18:24:01 +0900 (JST) Received: from localhost.localdomain (unknown [10.19.3.107]) by m3004.s.css.fujitsu.com (Postfix) with ESMTP id B3F0A20607A2; Fri, 11 Nov 2022 18:24:01 +0900 (JST) From: Daisuke Matsuda To: linux-rdma@vger.kernel.org, leonro@nvidia.com, jgg@nvidia.com, zyjzyj2000@gmail.com Cc: nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com, yangx.jy@fujitsu.com, lizhijian@fujitsu.com, y-goto@fujitsu.com, Daisuke Matsuda Subject: [RFC PATCH v2 7/7] RDMA/rxe: Add support for the traditional Atomic operations with ODP Date: Fri, 11 Nov 2022 18:22:28 +0900 Message-Id: <07a8870f4c7aea3aa876727f8264ec4fb33ed774.1668157436.git.matsuda-daisuke@fujitsu.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 Enable 'fetch and add' and 'compare and swap' operations to manipulate data in an ODP-enabled MR. This is comprised of the following steps: 1. Check the driver page table(umem_odp->dma_list) to see if the target page is both readable and writable. 2. If not, then trigger page fault to map the page. 3. Convert its user space address to a kernel logical address using PFNs in the driver page table(umem_odp->pfn_list). 4. Execute the operation. umem_mutex is used to ensure that dma_list (an array of addresses of an MR) is not changed while it is checked and that the target page is not invalidated before data access completes. Signed-off-by: Daisuke Matsuda --- drivers/infiniband/sw/rxe/rxe.c | 1 + drivers/infiniband/sw/rxe/rxe_loc.h | 2 ++ drivers/infiniband/sw/rxe/rxe_odp.c | 45 ++++++++++++++++++++++++++++ drivers/infiniband/sw/rxe/rxe_resp.c | 2 +- drivers/infiniband/sw/rxe/rxe_resp.h | 3 ++ 5 files changed, 52 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c index dd287fc60e9d..8190af3e9afe 100644 --- a/drivers/infiniband/sw/rxe/rxe.c +++ b/drivers/infiniband/sw/rxe/rxe.c @@ -88,6 +88,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe) rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV; rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE; rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ; + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC; rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV; } } diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index 8b19b6fdc497..6370dc31c83a 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -194,5 +194,7 @@ int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova, int access_flags, struct rxe_mr *mr); int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length, enum rxe_mr_copy_dir dir); +enum resp_states rxe_odp_atomic_ops(struct rxe_qp *qp, struct rxe_pkt_info *pkt, + struct rxe_mr *mr); #endif /* RXE_LOC_H */ diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c index ba4723818ee7..00aab9071737 100644 --- a/drivers/infiniband/sw/rxe/rxe_odp.c +++ b/drivers/infiniband/sw/rxe/rxe_odp.c @@ -289,3 +289,48 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length, return err; } + +static inline void *rxe_odp_get_virt_atomic(struct rxe_qp *qp, struct rxe_mr *mr) +{ + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + u64 iova = qp->resp.va + qp->resp.offset; + int idx; + size_t offset; + + if (rxe_odp_map_range(mr, iova, sizeof(char), 0)) + return NULL; + + idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift; + offset = iova & (BIT(umem_odp->page_shift) - 1); + + return rxe_odp_get_virt(umem_odp, idx, offset); +} + +enum resp_states rxe_odp_atomic_ops(struct rxe_qp *qp, struct rxe_pkt_info *pkt, + struct rxe_mr *mr) +{ + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); + u64 *vaddr; + int ret; + + if (unlikely(!mr->odp_enabled)) + return RESPST_ERR_RKEY_VIOLATION; + + /* If pagefault is not required, umem mutex will be held until an + * atomic operation completes. Otherwise, it is released and locked + * again in rxe_odp_map_range() to let invalidation handler do its + * work meanwhile. + */ + mutex_lock(&umem_odp->umem_mutex); + + vaddr = (u64 *)rxe_odp_get_virt_atomic(qp, mr); + + if (pkt->mask & RXE_ATOMIC_MASK) + ret = rxe_process_atomic(qp, pkt, vaddr); + else + /* ATOMIC WRITE operation will come here. */ + ret = RESPST_ERR_RKEY_VIOLATION; + + mutex_unlock(&umem_odp->umem_mutex); + return ret; +} diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 296b9ccee330..8e6a32c6c9e7 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -647,7 +647,7 @@ static enum resp_states rxe_atomic_reply(struct rxe_qp *qp, return RESPST_ERR_RKEY_VIOLATION; if (mr->odp_enabled) - ret = RESPST_ERR_UNSUPPORTED_OPCODE; + ret = rxe_odp_atomic_ops(qp, pkt, mr); else ret = rxe_atomic_ops(qp, pkt, mr); } else diff --git a/drivers/infiniband/sw/rxe/rxe_resp.h b/drivers/infiniband/sw/rxe/rxe_resp.h index 121f0b998196..cb907b49175f 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.h +++ b/drivers/infiniband/sw/rxe/rxe_resp.h @@ -38,4 +38,7 @@ enum resp_states { RESPST_EXIT, }; +enum resp_states rxe_process_atomic(struct rxe_qp *qp, + struct rxe_pkt_info *pkt, u64 *vaddr); + #endif /* RXE_RESP_H */