From patchwork Mon Dec 18 22:31:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 13497653 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC8981DA28; Mon, 18 Dec 2023 22:31:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fOPVDpx6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 17D9BC433CB; Mon, 18 Dec 2023 22:31:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702938709; bh=eYSMCXEWCP5bh3BeAGAIeQYc5eJ8cHs4zKGVtJ4qhxA=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=fOPVDpx68Vbz4gRtIX+fkT5OAIqC92X2VrDHRt2nBMxXUOQslOID929sXylMXuodA 9FJGesLqk9A9OcQg+sCvYqdhC3ufoBrmJzjiPFwD16rNrhQdgRIl91QSxilN9BWo0v 2nOyLUhXVhJxKltqOQyWyvulgUkDRPtq49FgZ6HyTvzu5X95LwrY55fEtuVs2Eon1n TpRaTNqMQIB1VGaUY9PAt7XbmGR7N9ijyjqbZ1Q8dmpMcvj7YQixDgk0oClnVkzVwF PF2t2PcjJZDI9533nLcKJLWpSfNNUoeuYAAbTgVtQVKr2/z2LReo4HIauTT7mT8BH8 6vGghqVhOuu8w== Subject: [PATCH v1 1/4] svcrdma: Add back svc_rdma_recv_ctxt::rc_pages From: Chuck Lever To: linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org Cc: tom@talpey.com Date: Mon, 18 Dec 2023 17:31:48 -0500 Message-ID: <170293870808.4604.205731294518425415.stgit@bazille.1015granger.net> In-Reply-To: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> References: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Chuck Lever Having an nfsd thread waiting for an RDMA Read completion is problematic if the Read responder (the client) stops responding. We need to go back to handling RDMA Reads by allowing the nfsd thread to return to the svc scheduler, then waking a second thread finish the RPC message once the Read completion fires. To start with, restore the rc_pages field so that RDMA Read pages can be managed across calls to svc_rdma_recvfrom(). Signed-off-by: Chuck Lever --- include/linux/sunrpc/svc_rdma.h | 4 +++- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 5 +++++ net/sunrpc/xprtrdma/svc_rdma_rw.c | 4 +++- 3 files changed, 11 insertions(+), 2 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 46f2ce9f810b..0f2d7f68ef5d 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -183,7 +183,6 @@ struct svc_rdma_recv_ctxt { void *rc_recv_buf; struct xdr_stream rc_stream; u32 rc_byte_len; - unsigned int rc_page_count; u32 rc_inv_rkey; __be32 rc_msgtype; @@ -199,6 +198,9 @@ struct svc_rdma_recv_ctxt { struct svc_rdma_chunk *rc_cur_result_payload; struct svc_rdma_pcl rc_write_pcl; struct svc_rdma_pcl rc_reply_pcl; + + unsigned int rc_page_count; + struct page *rc_pages[RPCSVC_MAXPAGES]; }; struct svc_rdma_send_ctxt { diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 38f01652dc6d..e363cb1bdbc4 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -214,6 +214,11 @@ struct svc_rdma_recv_ctxt *svc_rdma_recv_ctxt_get(struct svcxprt_rdma *rdma) void svc_rdma_recv_ctxt_put(struct svcxprt_rdma *rdma, struct svc_rdma_recv_ctxt *ctxt) { + /* @rc_page_count is normally zero here, but error flows + * can leave pages in @rc_pages. + */ + release_pages(ctxt->rc_pages, ctxt->rc_page_count); + pcl_free(&ctxt->rc_call_pcl); pcl_free(&ctxt->rc_read_pcl); pcl_free(&ctxt->rc_write_pcl); diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c index ff54bb268b7d..28a34718dee5 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c @@ -1131,7 +1131,9 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma, rqstp->rq_respages = &rqstp->rq_pages[head->rc_page_count]; rqstp->rq_next_page = rqstp->rq_respages + 1; - /* Ensure svc_rdma_recv_ctxt_put() does not try to release pages */ + /* Ensure svc_rdma_recv_ctxt_put() does not release pages + * left in @rc_pages while I/O proceeds. + */ head->rc_page_count = 0; out_err: From patchwork Mon Dec 18 22:31:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 13497654 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E30574E23; Mon, 18 Dec 2023 22:31:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="M9Aig6W7" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 84ED9C433C9; Mon, 18 Dec 2023 22:31:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702938715; bh=5OczCUEXbmz+CueDQ8PjXd5RL0ZcFxhuPaTEcoCLFH8=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=M9Aig6W7m0QPmN06VSTvE/SKI6DWlZOvQAT/R9MhT/+DzMBX1VeSb5glLSKpfjp8u tQc49cjhcdhZHbTEKoydm8BSozxsye8rci/qT+u86Xrktycn77u5Yb/9IGkFi7sSri QzlWZ4pvH4Ni/BH1IH+XgpLNOMLSt/1rM0RkrFZczn5qgPiW0Sp1JyUCRjRoBGUtAq Tg0TYDs91YKgcW936cMvvgfPual81eK8FK0y97Bq+NwdALLJWi127bEx4iy5fwTiQ1 oomSQE02pWMS1/Bev385+pqsOi6y4a2Nj/EeIyokUhhtQmFWFmOlsjYxsX21Om1SLv 6YYnhyE/Ttjgg== Subject: [PATCH v1 2/4] svcrdma: Add back svcxprt_rdma::sc_read_complete_q From: Chuck Lever To: linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org Cc: tom@talpey.com Date: Mon, 18 Dec 2023 17:31:54 -0500 Message-ID: <170293871455.4604.10661587001070835855.stgit@bazille.1015granger.net> In-Reply-To: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> References: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Chuck Lever Having an nfsd thread waiting for an RDMA Read completion is problematic if the Read responder (ie, the client) stops responding. We need to go back to handling RDMA Reads by allowing the nfsd thread to return to the svc scheduler, then waking a second thread finish the RPC message once the Read completion fires. As a next step, add a list_head upon which completed Reads are queued. A subsequent patch will make use of this queue. Signed-off-by: Chuck Lever --- include/linux/sunrpc/svc_rdma.h | 1 + net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 37 +++++++++++++++++++++++++++++- net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 + 3 files changed, 38 insertions(+), 1 deletion(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 0f2d7f68ef5d..c98d29e51b9c 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -98,6 +98,7 @@ struct svcxprt_rdma { u32 sc_pending_recvs; u32 sc_recv_batch; struct list_head sc_rq_dto_q; + struct list_head sc_read_complete_q; spinlock_t sc_rq_dto_lock; struct ib_qp *sc_qp; struct ib_cq *sc_rq_cq; diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index e363cb1bdbc4..2de947183a7a 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -382,6 +382,10 @@ void svc_rdma_flush_recv_queues(struct svcxprt_rdma *rdma) { struct svc_rdma_recv_ctxt *ctxt; + while ((ctxt = svc_rdma_next_recv_ctxt(&rdma->sc_read_complete_q))) { + list_del(&ctxt->rc_list); + svc_rdma_recv_ctxt_put(rdma, ctxt); + } while ((ctxt = svc_rdma_next_recv_ctxt(&rdma->sc_rq_dto_q))) { list_del(&ctxt->rc_list); svc_rdma_recv_ctxt_put(rdma, ctxt); @@ -763,6 +767,30 @@ static bool svc_rdma_is_reverse_direction_reply(struct svc_xprt *xprt, return true; } +static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp, + struct svc_rdma_recv_ctxt *ctxt) +{ + int i; + + /* Transfer the Read chunk pages into @rqstp.rq_pages, replacing + * the rq_pages that were already allocated for this rqstp. + */ + release_pages(rqstp->rq_respages, ctxt->rc_page_count); + for (i = 0; i < ctxt->rc_page_count; i++) + rqstp->rq_pages[i] = ctxt->rc_pages[i]; + + /* Update @rqstp's result send buffer to start after the + * last page in the RDMA Read payload. + */ + rqstp->rq_respages = &rqstp->rq_pages[ctxt->rc_page_count]; + rqstp->rq_next_page = rqstp->rq_respages + 1; + + /* Prevent svc_rdma_recv_ctxt_put() from releasing the + * pages in ctxt::rc_pages a second time. + */ + ctxt->rc_page_count = 0; +} + /** * svc_rdma_recvfrom - Receive an RPC call * @rqstp: request structure into which to receive an RPC Call @@ -807,8 +835,14 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) rqstp->rq_xprt_ctxt = NULL; - ctxt = NULL; spin_lock(&rdma_xprt->sc_rq_dto_lock); + ctxt = svc_rdma_next_recv_ctxt(&rdma_xprt->sc_read_complete_q); + if (ctxt) { + list_del(&ctxt->rc_list); + spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock); + svc_rdma_read_complete(rqstp, ctxt); + goto complete; + } ctxt = svc_rdma_next_recv_ctxt(&rdma_xprt->sc_rq_dto_q); if (ctxt) list_del(&ctxt->rc_list); @@ -846,6 +880,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) goto out_readfail; } +complete: rqstp->rq_xprt_ctxt = ctxt; rqstp->rq_prot = IPPROTO_MAX; svc_xprt_copy_addrs(rqstp, xprt); diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 0ceb2817ca4d..10f01f4bdac6 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -137,6 +137,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv, svc_xprt_init(net, &svc_rdma_class, &cma_xprt->sc_xprt, serv); INIT_LIST_HEAD(&cma_xprt->sc_accept_q); INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q); + INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q); init_llist_head(&cma_xprt->sc_send_ctxts); init_llist_head(&cma_xprt->sc_recv_ctxts); init_llist_head(&cma_xprt->sc_rw_ctxts); From patchwork Mon Dec 18 22:32:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 13497655 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C00EA74E25; Mon, 18 Dec 2023 22:32:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="jl+Jcq54" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F27A7C433C7; Mon, 18 Dec 2023 22:32:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702938722; bh=burEwcbXLcYgXuGR/fX2hub2bWTWY41UK8kGHPY0GdI=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=jl+Jcq54NRHTnoDvYi3U36dBQBa9fgbQBRi5AamuePNewzmb1+ZnzBn/8+tMMwWsu 1f3os0ylOkMIjLLjauOCnXWcA3NWO8I5fI3MceyLs4AuwOLlw5Useb542HINJZb7DY bWRg1vm3ArM7kxR0OIZ7dZn//IaJUpFF6VB85BMkjy7Gas3Qxx14x25xB2MHEW1fZn 5aCC92YMJg6M3QF84fo1WvLVdDsJBM5LVJp1wI4cHuBGs8cWIwhHJC+YiDr5kuUNuH BnT9rKuIvFid1w4SCd6J39+SU5bxad3WuIAJ81LzX1jn/XwSDqI+8DjPqlX1Wu3OAq MONK92StTqKmg== Subject: [PATCH v1 3/4] svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete() From: Chuck Lever To: linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org Cc: tom@talpey.com Date: Mon, 18 Dec 2023 17:32:01 -0500 Message-ID: <170293872099.4604.16258519407111601722.stgit@bazille.1015granger.net> In-Reply-To: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> References: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Chuck Lever Once a set of RDMA Reads are complete, the Read completion handler will poke the transport to trigger a second call to svc_rdma_recvfrom(). recvfrom() will then merge the RDMA Read payloads with the previously received RPC header to form a completed RPC Call message. The new code is copied from the svc_rdma_process_read_list() path. A subsequent patch will make use of this code and remove the code that this was copied from (svc_rdma_rw.c). Signed-off-by: Chuck Lever --- include/trace/events/rpcrdma.h | 1 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 93 +++++++++++++++++++++++++++++++ 2 files changed, 93 insertions(+), 1 deletion(-) diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h index 9a3fc6eb09a8..110c1475c527 100644 --- a/include/trace/events/rpcrdma.h +++ b/include/trace/events/rpcrdma.h @@ -2112,6 +2112,7 @@ TRACE_EVENT(svcrdma_wc_read, DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_read_flush); DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_read_err); +DEFINE_SIMPLE_CID_EVENT(svcrdma_read_finished); DEFINE_SIMPLE_CID_EVENT(svcrdma_wc_write); DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_write_flush); diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 2de947183a7a..034bdd02f925 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -767,10 +767,86 @@ static bool svc_rdma_is_reverse_direction_reply(struct svc_xprt *xprt, return true; } +/* Finish constructing the RPC Call message in rqstp::rq_arg. + * + * The incoming RPC/RDMA message is an RDMA_MSG type message + * with a single Read chunk (only the upper layer data payload + * was conveyed via RDMA Read). + */ +static void svc_rdma_read_complete_one(struct svc_rqst *rqstp, + struct svc_rdma_recv_ctxt *ctxt) +{ + struct svc_rdma_chunk *chunk = pcl_first_chunk(&ctxt->rc_read_pcl); + struct xdr_buf *buf = &rqstp->rq_arg; + unsigned int length; + + /* Split the Receive buffer between the head and tail + * buffers at Read chunk's position. XDR roundup of the + * chunk is not included in either the pagelist or in + * the tail. + */ + buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position; + buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position; + buf->head[0].iov_len = chunk->ch_position; + + /* Read chunk may need XDR roundup (see RFC 8166, s. 3.4.5.2). + * + * If the client already rounded up the chunk length, the + * length does not change. Otherwise, the length of the page + * list is increased to include XDR round-up. + * + * Currently these chunks always start at page offset 0, + * thus the rounded-up length never crosses a page boundary. + */ + buf->pages = &rqstp->rq_pages[0]; + length = xdr_align_size(chunk->ch_length); + buf->page_len = length; + buf->len += length; + buf->buflen += length; +} + +/* Finish constructing the RPC Call message in rqstp::rq_arg. + * + * The incoming RPC/RDMA message is an RDMA_MSG type message + * with payload in multiple Read chunks and no PZRC. + */ +static void svc_rdma_read_complete_multiple(struct svc_rqst *rqstp, + struct svc_rdma_recv_ctxt *ctxt) +{ + struct xdr_buf *buf = &rqstp->rq_arg; + + buf->len += ctxt->rc_readbytes; + buf->buflen += ctxt->rc_readbytes; + + buf->head[0].iov_base = page_address(rqstp->rq_pages[0]); + buf->head[0].iov_len = min_t(size_t, PAGE_SIZE, ctxt->rc_readbytes); + buf->pages = &rqstp->rq_pages[1]; + buf->page_len = ctxt->rc_readbytes - buf->head[0].iov_len; +} + +/* Finish constructing the RPC Call message in rqstp::rq_arg. + * + * The incoming RPC/RDMA message is an RDMA_NOMSG type message + * (the RPC message body was conveyed via RDMA Read). + */ +static void svc_rdma_read_complete_pzrc(struct svc_rqst *rqstp, + struct svc_rdma_recv_ctxt *ctxt) +{ + struct xdr_buf *buf = &rqstp->rq_arg; + + buf->len += ctxt->rc_readbytes; + buf->buflen += ctxt->rc_readbytes; + + buf->head[0].iov_base = page_address(rqstp->rq_pages[0]); + buf->head[0].iov_len = min_t(size_t, PAGE_SIZE, ctxt->rc_readbytes); + buf->pages = &rqstp->rq_pages[1]; + buf->page_len = ctxt->rc_readbytes - buf->head[0].iov_len; +} + static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp, struct svc_rdma_recv_ctxt *ctxt) { - int i; + unsigned int i; /* Transfer the Read chunk pages into @rqstp.rq_pages, replacing * the rq_pages that were already allocated for this rqstp. @@ -789,6 +865,21 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp, * pages in ctxt::rc_pages a second time. */ ctxt->rc_page_count = 0; + + /* Finish constructing the RPC Call message. The exact + * procedure for that depends on what kind of RPC/RDMA + * chunks were provided by the client. + */ + if (pcl_is_empty(&ctxt->rc_call_pcl)) { + if (ctxt->rc_read_pcl.cl_count == 1) + svc_rdma_read_complete_one(rqstp, ctxt); + else + svc_rdma_read_complete_multiple(rqstp, ctxt); + } else { + svc_rdma_read_complete_pzrc(rqstp, ctxt); + } + + trace_svcrdma_read_finished(&ctxt->rc_cid); } /** From patchwork Mon Dec 18 22:32:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 13497656 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E28476080; Mon, 18 Dec 2023 22:32:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ZwWC7i0+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 613C1C433C7; Mon, 18 Dec 2023 22:32:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702938728; bh=kmjPRrKt+OLy6oEhR8cZyITBuhz7Wbbxj6JUMpJygf4=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=ZwWC7i0+WvteXG3QvJLN80wmlzMQC6x/ve8p9yAUHgFsTEW+E+96Xm5Z855s33c03 /8RowHprq1tJVT9w6WcYbldy7QQRIKh+0OeswTjXj1WoKVCIykCNTDaNshK88HPVuC MUHpKQGjDjCMs3KF16l4sSb89DSqpfyFNEYWkXY1pwSoRel/c+NeIlhfotVwptzUYB dC67rb8nUXM77JgpNmW8gq1SH0dPDB1Rx0FXEFJ7P/3KLDI9mJu+QhM5JVVMpJLU2M 8lfeIWQ/oCOF9IBSNlzG6G9puQUbZnWMhipftjoH3ukKDkr5mq6Fli78LcTB+cd1Ae 9Wy/ieUU1V5ZQ== Subject: [PATCH v1 4/4] svcrdma: Implement multi-stage Read completion again From: Chuck Lever To: linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org Cc: tom@talpey.com Date: Mon, 18 Dec 2023 17:32:07 -0500 Message-ID: <170293872745.4604.13501370140544523227.stgit@bazille.1015granger.net> In-Reply-To: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> References: <170293795877.4604.12721267378032407419.stgit@bazille.1015granger.net> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Chuck Lever Having an nfsd thread waiting for an RDMA Read completion is problematic if the Read responder (ie, the client) stops responding. We need to go back to handling RDMA Reads by getting the svc scheduler to call svc_rdma_recvfrom() a second time to finish building an RPC message after a Read completion. This is the final patch, and makes several changes that have to happen concurrently: 1. svc_rdma_process_read_list no longer waits for a completion, but simply builds and posts the Read WRs. 2. svc_rdma_read_done() now queues a completed Read on sc_read_complete_q for later processing rather than calling complete(). 3. The completed RPC message is no longer built in the svc_rdma_process_read_list() path. Finishing the message is now done in svc_rdma_recvfrom() when it notices work on the sc_read_complete_q. The "finish building this RPC message" code is removed from the svc_rdma_process_read_list() path. This arrangement avoids the need for an nfsd thread to wait for an RDMA Read non-interruptibly without a timeout. It's basically the same code structure that Tom Tucker used for Read chunks along with some clean-up and modernization. Signed-off-by: Chuck Lever --- include/linux/sunrpc/svc_rdma.h | 6 + net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 36 +++++-- net/sunrpc/xprtrdma/svc_rdma_rw.c | 151 +++++++++++-------------------- 3 files changed, 80 insertions(+), 113 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index c98d29e51b9c..e7595ae62fe2 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -170,8 +170,6 @@ struct svc_rdma_chunk_ctxt { struct list_head cc_rwctxts; ktime_t cc_posttime; int cc_sqecount; - enum ib_wc_status cc_status; - struct completion cc_done; }; struct svc_rdma_recv_ctxt { @@ -191,6 +189,7 @@ struct svc_rdma_recv_ctxt { unsigned int rc_pageoff; unsigned int rc_curpage; unsigned int rc_readbytes; + struct xdr_buf rc_saved_arg; struct svc_rdma_chunk_ctxt rc_cc; struct svc_rdma_pcl rc_call_pcl; @@ -240,6 +239,9 @@ extern int svc_rdma_recvfrom(struct svc_rqst *); extern void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma); extern void svc_rdma_cc_init(struct svcxprt_rdma *rdma, struct svc_rdma_chunk_ctxt *cc); +extern void svc_rdma_cc_release(struct svcxprt_rdma *rdma, + struct svc_rdma_chunk_ctxt *cc, + enum dma_data_direction dir); extern int svc_rdma_send_write_chunk(struct svcxprt_rdma *rdma, const struct svc_rdma_chunk *chunk, const struct xdr_buf *xdr); diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 034bdd02f925..d72953f29258 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -214,6 +214,8 @@ struct svc_rdma_recv_ctxt *svc_rdma_recv_ctxt_get(struct svcxprt_rdma *rdma) void svc_rdma_recv_ctxt_put(struct svcxprt_rdma *rdma, struct svc_rdma_recv_ctxt *ctxt) { + svc_rdma_cc_release(rdma, &ctxt->rc_cc, DMA_FROM_DEVICE); + /* @rc_page_count is normally zero here, but error flows * can leave pages in @rc_pages. */ @@ -870,6 +872,7 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp, * procedure for that depends on what kind of RPC/RDMA * chunks were provided by the client. */ + rqstp->rq_arg = ctxt->rc_saved_arg; if (pcl_is_empty(&ctxt->rc_call_pcl)) { if (ctxt->rc_read_pcl.cl_count == 1) svc_rdma_read_complete_one(rqstp, ctxt); @@ -930,7 +933,8 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) ctxt = svc_rdma_next_recv_ctxt(&rdma_xprt->sc_read_complete_q); if (ctxt) { list_del(&ctxt->rc_list); - spin_unlock_bh(&rdma_xprt->sc_rq_dto_lock); + spin_unlock(&rdma_xprt->sc_rq_dto_lock); + svc_xprt_received(xprt); svc_rdma_read_complete(rqstp, ctxt); goto complete; } @@ -965,11 +969,8 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) svc_rdma_get_inv_rkey(rdma_xprt, ctxt); if (!pcl_is_empty(&ctxt->rc_read_pcl) || - !pcl_is_empty(&ctxt->rc_call_pcl)) { - ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt); - if (ret < 0) - goto out_readfail; - } + !pcl_is_empty(&ctxt->rc_call_pcl)) + goto out_readlist; complete: rqstp->rq_xprt_ctxt = ctxt; @@ -983,12 +984,23 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp) svc_rdma_recv_ctxt_put(rdma_xprt, ctxt); return 0; -out_readfail: - if (ret == -EINVAL) - svc_rdma_send_error(rdma_xprt, ctxt, ret); - svc_rdma_recv_ctxt_put(rdma_xprt, ctxt); - svc_xprt_deferred_close(xprt); - return -ENOTCONN; +out_readlist: + /* This @rqstp is about to be recycled. Save the work + * already done constructing the Call message in rq_arg + * so it can be restored when the RDMA Reads have + * completed. + */ + ctxt->rc_saved_arg = rqstp->rq_arg; + + ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt); + if (ret < 0) { + if (ret == -EINVAL) + svc_rdma_send_error(rdma_xprt, ctxt, ret); + svc_rdma_recv_ctxt_put(rdma_xprt, ctxt); + svc_xprt_deferred_close(xprt); + return ret; + } + return 0; out_backchannel: svc_rdma_handle_bc_reply(rqstp, ctxt); diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c index 28a34718dee5..c00fcce61d1e 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c @@ -163,14 +163,15 @@ void svc_rdma_cc_init(struct svcxprt_rdma *rdma, cc->cc_sqecount = 0; } -/* - * The consumed rw_ctx's are cleaned and placed on a local llist so - * that only one atomic llist operation is needed to put them all - * back on the free list. +/** + * svc_rdma_cc_release - Release resources held by a svc_rdma_chunk_ctxt + * @rdma: controlling transport instance + * @cc: svc_rdma_chunk_ctxt to be released + * @dir: DMA direction */ -static void svc_rdma_cc_release(struct svcxprt_rdma *rdma, - struct svc_rdma_chunk_ctxt *cc, - enum dma_data_direction dir) +void svc_rdma_cc_release(struct svcxprt_rdma *rdma, + struct svc_rdma_chunk_ctxt *cc, + enum dma_data_direction dir) { struct llist_node *first, *last; struct svc_rdma_rw_ctxt *ctxt; @@ -300,12 +301,21 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc) container_of(cqe, struct svc_rdma_chunk_ctxt, cc_cqe); struct svc_rdma_recv_ctxt *ctxt; + svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount); + + ctxt = container_of(cc, struct svc_rdma_recv_ctxt, rc_cc); switch (wc->status) { case IB_WC_SUCCESS: - ctxt = container_of(cc, struct svc_rdma_recv_ctxt, rc_cc); trace_svcrdma_wc_read(wc, &cc->cc_cid, ctxt->rc_readbytes, cc->cc_posttime); - break; + + spin_lock(&rdma->sc_rq_dto_lock); + list_add_tail(&ctxt->rc_list, &rdma->sc_read_complete_q); + /* the unlock pairs with the smp_rmb in svc_xprt_ready */ + set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags); + spin_unlock(&rdma->sc_rq_dto_lock); + svc_xprt_enqueue(&rdma->sc_xprt); + return; case IB_WC_WR_FLUSH_ERR: trace_svcrdma_wc_read_flush(wc, &cc->cc_cid); break; @@ -313,10 +323,13 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc) trace_svcrdma_wc_read_err(wc, &cc->cc_cid); } - svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount); - cc->cc_status = wc->status; - complete(&cc->cc_done); - return; + /* The RDMA Read has flushed, so the incoming RPC message + * cannot be constructed and must be dropped. Signal the + * loss to the client by closing the connection. + */ + svc_rdma_cc_release(rdma, cc, DMA_FROM_DEVICE); + svc_rdma_recv_ctxt_put(rdma, ctxt); + svc_xprt_deferred_close(&rdma->sc_xprt); } /* @@ -823,7 +836,6 @@ svc_rdma_read_multiple_chunks(struct svc_rqst *rqstp, struct svc_rdma_recv_ctxt *head) { const struct svc_rdma_pcl *pcl = &head->rc_read_pcl; - struct xdr_buf *buf = &rqstp->rq_arg; struct svc_rdma_chunk *chunk, *next; unsigned int start, length; int ret; @@ -853,18 +865,7 @@ svc_rdma_read_multiple_chunks(struct svc_rqst *rqstp, start += length; length = head->rc_byte_len - start; - ret = svc_rdma_copy_inline_range(rqstp, head, start, length); - if (ret < 0) - return ret; - - buf->len += head->rc_readbytes; - buf->buflen += head->rc_readbytes; - - buf->head[0].iov_base = page_address(rqstp->rq_pages[0]); - buf->head[0].iov_len = min_t(size_t, PAGE_SIZE, head->rc_readbytes); - buf->pages = &rqstp->rq_pages[1]; - buf->page_len = head->rc_readbytes - buf->head[0].iov_len; - return 0; + return svc_rdma_copy_inline_range(rqstp, head, start, length); } /** @@ -888,42 +889,8 @@ svc_rdma_read_multiple_chunks(struct svc_rqst *rqstp, static int svc_rdma_read_data_item(struct svc_rqst *rqstp, struct svc_rdma_recv_ctxt *head) { - struct xdr_buf *buf = &rqstp->rq_arg; - struct svc_rdma_chunk *chunk; - unsigned int length; - int ret; - - chunk = pcl_first_chunk(&head->rc_read_pcl); - ret = svc_rdma_build_read_chunk(rqstp, head, chunk); - if (ret < 0) - goto out; - - /* Split the Receive buffer between the head and tail - * buffers at Read chunk's position. XDR roundup of the - * chunk is not included in either the pagelist or in - * the tail. - */ - buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position; - buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position; - buf->head[0].iov_len = chunk->ch_position; - - /* Read chunk may need XDR roundup (see RFC 8166, s. 3.4.5.2). - * - * If the client already rounded up the chunk length, the - * length does not change. Otherwise, the length of the page - * list is increased to include XDR round-up. - * - * Currently these chunks always start at page offset 0, - * thus the rounded-up length never crosses a page boundary. - */ - buf->pages = &rqstp->rq_pages[0]; - length = xdr_align_size(chunk->ch_length); - buf->page_len = length; - buf->len += length; - buf->buflen += length; - -out: - return ret; + return svc_rdma_build_read_chunk(rqstp, head, + pcl_first_chunk(&head->rc_read_pcl)); } /** @@ -1051,23 +1018,28 @@ static int svc_rdma_read_call_chunk(struct svc_rqst *rqstp, static noinline int svc_rdma_read_special(struct svc_rqst *rqstp, struct svc_rdma_recv_ctxt *head) { - struct xdr_buf *buf = &rqstp->rq_arg; - int ret; - - ret = svc_rdma_read_call_chunk(rqstp, head); - if (ret < 0) - goto out; - - buf->len += head->rc_readbytes; - buf->buflen += head->rc_readbytes; + return svc_rdma_read_call_chunk(rqstp, head); +} - buf->head[0].iov_base = page_address(rqstp->rq_pages[0]); - buf->head[0].iov_len = min_t(size_t, PAGE_SIZE, head->rc_readbytes); - buf->pages = &rqstp->rq_pages[1]; - buf->page_len = head->rc_readbytes - buf->head[0].iov_len; +/* Pages under I/O have been copied to head->rc_pages. Ensure that + * svc_xprt_release() does not put them when svc_rdma_recvfrom() + * returns. This has to be done after all Read WRs are constructed + * to properly handle a page that happens to be part of I/O on behalf + * of two different RDMA segments. + * + * Note: if the subsequent post_send fails, these pages have already + * been moved to head->rc_pages and thus will be cleaned up by + * svc_rdma_recv_ctxt_put(). + */ +static void svc_rdma_clear_rqst_pages(struct svc_rqst *rqstp, + struct svc_rdma_recv_ctxt *head) +{ + unsigned int i; -out: - return ret; + for (i = 0; i < head->rc_page_count; i++) { + head->rc_pages[i] = rqstp->rq_pages[i]; + rqstp->rq_pages[i] = NULL; + } } /** @@ -1113,30 +1085,11 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma, ret = svc_rdma_read_multiple_chunks(rqstp, head); } else ret = svc_rdma_read_special(rqstp, head); + svc_rdma_clear_rqst_pages(rqstp, head); if (ret < 0) - goto out_err; + return ret; trace_svcrdma_post_read_chunk(&cc->cc_cid, cc->cc_sqecount); - init_completion(&cc->cc_done); ret = svc_rdma_post_chunk_ctxt(rdma, cc); - if (ret < 0) - goto out_err; - - ret = 1; - wait_for_completion(&cc->cc_done); - if (cc->cc_status != IB_WC_SUCCESS) - ret = -EIO; - - /* rq_respages starts after the last arg page */ - rqstp->rq_respages = &rqstp->rq_pages[head->rc_page_count]; - rqstp->rq_next_page = rqstp->rq_respages + 1; - - /* Ensure svc_rdma_recv_ctxt_put() does not release pages - * left in @rc_pages while I/O proceeds. - */ - head->rc_page_count = 0; - -out_err: - svc_rdma_cc_release(rdma, cc, DMA_FROM_DEVICE); - return ret; + return ret < 0 ? ret : 1; }