diff mbox series

[v15,06/20] nvme-tcp: Add DDP data-path

Message ID 20230912095949.5474-7-aaptel@nvidia.com (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series nvme-tcp receive offloads | expand

Checks

Context Check Description
netdev/series_format fail Series longer than 15 patches (and no cover letter)
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1340 this patch: 1340
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 1363 this patch: 1363
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1363 this patch: 1363
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 166 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Aurelien Aptel Sept. 12, 2023, 9:59 a.m. UTC
From: Boris Pismenny <borisp@nvidia.com>

Introduce the NVMe-TCP DDP data-path offload.
Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id in the PDU.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

The data-path interface contains two routines: setup/teardown.
The setup provides the mapping from command_id to the request buffers,
while the teardown removes this mapping.

For efficiency, we introduce an asynchronous nvme completion, which is
split between NVMe-TCP and the NIC driver as follows:
NVMe-TCP performs the specific completion, while NIC driver performs the
generic mq_blk completion.

Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Shai Malin <smalin@nvidia.com>
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/host/tcp.c | 111 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 106 insertions(+), 5 deletions(-)

Comments

Sagi Grimberg Sept. 13, 2023, 10:51 a.m. UTC | #1
> @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>   	else
>   		msg.msg_flags |= MSG_EOR;
>   
> +	if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
> +		nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req));

Didn't we agree that setup_ddp should move to setup time and
not send time?
Aurelien Aptel Sept. 18, 2023, 6:26 p.m. UTC | #2
Sagi Grimberg <sagi@grimberg.me> writes:
>> @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>>       else
>>               msg.msg_flags |= MSG_EOR;
>>
>> +     if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
>> +             nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req));
>
> Didn't we agree that setup_ddp should move to setup time and
> not send time?

We believe we haven't reached a conclusion last time [1].

Moving the setup_ddp() call earlier at setup time is less efficient (up
to ~15% less IOPS) when it does the work on a different CPU.

1: https://lore.kernel.org/all/253h6oxvlwd.fsf@nvidia.com/
Sagi Grimberg Sept. 19, 2023, 7:04 a.m. UTC | #3
>>> @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>>>        else
>>>                msg.msg_flags |= MSG_EOR;
>>>
>>> +     if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
>>> +             nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req));
>>
>> Didn't we agree that setup_ddp should move to setup time and
>> not send time?
> 
> We believe we haven't reached a conclusion last time [1].
> 
> Moving the setup_ddp() call earlier at setup time is less efficient (up
> to ~15% less IOPS) when it does the work on a different CPU.
> 
> 1: https://lore.kernel.org/all/253h6oxvlwd.fsf@nvidia.com/

Can you please explain why? sk_incoming_cpu is updated from the network
recv path while you are arguing that the timing matters before you even
send the pdu. I don't understand why should that matter.
Aurelien Aptel Sept. 20, 2023, 8:39 a.m. UTC | #4
Sagi Grimberg <sagi@grimberg.me> writes:
> Can you please explain why? sk_incoming_cpu is updated from the network
> recv path while you are arguing that the timing matters before you even
> send the pdu. I don't understand why should that matter.

Sorry, the original answer was misleading.
The problem is not about the timing but only about which CPU the code is
running on.  If we move setup_ddp() earlier as you suggested, it can
result it running on the wrong CPU.

Calling setup_ddp() in nvme_tcp_setup_cmd_pdu() will not guarantee we
are on running on the queue->io_cpu. It's only during
nvme_tcp_queue_request() that we either know we are running on
queue->io_cpu, or dispatch it to run on queue->io_cpu.

As it is only a performance optimization for the non-likely case, we can
move it to nvme_tcp_setup_cmd_pdu() as you suggested and re-consider in
the future if it will be needed.

Thanks
Sagi Grimberg Sept. 20, 2023, 10:11 a.m. UTC | #5
>> Can you please explain why? sk_incoming_cpu is updated from the network
>> recv path while you are arguing that the timing matters before you even
>> send the pdu. I don't understand why should that matter.
> 
> Sorry, the original answer was misleading.
> The problem is not about the timing but only about which CPU the code is
> running on.  If we move setup_ddp() earlier as you suggested, it can
> result it running on the wrong CPU.

Please define wrong CPU.

> Calling setup_ddp() in nvme_tcp_setup_cmd_pdu() will not guarantee we
> are on running on the queue->io_cpu.
> It's only during nvme_tcp_queue_request() that we either know we are running on
> queue->io_cpu, or dispatch it to run on queue->io_cpu.

But the sk_incmoing_cpu is updated with the cpu that is reading the
socket, so in fact it should converge to the io_cpu - shouldn't it?

Can you please provide a concrete explanation to the performance
degradation?

> As it is only a performance optimization for the non-likely case, we can
> move it to nvme_tcp_setup_cmd_pdu() as you suggested and re-consider in
> the future if it will be needed.

Would still like to understand this case.
Aurelien Aptel Sept. 20, 2023, 4:04 p.m. UTC | #6
Sagi Grimberg <sagi@grimberg.me> writes:
>> Sorry, the original answer was misleading.
>> The problem is not about the timing but only about which CPU the code is
>> running on.  If we move setup_ddp() earlier as you suggested, it can
>> result it running on the wrong CPU.
>
> Please define wrong CPU.

Let's say we connect with 1 IO queue on CPU 0.

We run our application which run IOs on multiple CPU cores (0 and 7 as
an example).
Whenever the IO was issued on CPU 7, setup_cmd_pdu() and queue_request()
will be run in the context of CPU 7.
We consider CPU 7 "wrong", because it isn't q->io_cpu (CPU 0).

It's only after queue_request() dispatches it that it will it run on CPU 0.

> But the sk_incmoing_cpu is updated with the cpu that is reading the
> socket, so in fact it should converge to the io_cpu - shouldn't it?

Yes, that is true.

> Can you please provide a concrete explanation to the performance
> degradation?

We believe the setup_ddp should be called from the CPU core on which the
nvme queue was created so all the IO path SW-HW interaction will run on
the same CPU core.
The performance degradation is relevant only to specific cases in which
the application will run on the "wrong" CPU core on which the NVMe queue
was not created.

If you don’t see it as a problem, we can move the setup_ddp to
setup_cmd_pdu().

Thanks
diff mbox series

Patch

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index f8322a07e27e..48bc8afe272d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -112,6 +112,13 @@  struct nvme_tcp_request {
 	size_t			offset;
 	size_t			data_sent;
 	enum nvme_tcp_send_state state;
+
+#ifdef CONFIG_ULP_DDP
+	bool			offloaded;
+	struct ulp_ddp_io	ddp;
+	__le16			ddp_status;
+	union nvme_result	result;
+#endif
 };
 
 enum nvme_tcp_queue_flags {
@@ -319,10 +326,72 @@  static bool nvme_tcp_ddp_query_limits(struct nvme_tcp_ctrl *ctrl)
 }
 
 static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
+static void nvme_tcp_ddp_teardown_done(void *ddp_ctx);
 static const struct ulp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = {
 	.resync_request		= nvme_tcp_resync_request,
+	.ddp_teardown_done	= nvme_tcp_ddp_teardown_done,
 };
 
+static void nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
+				  struct request *rq)
+{
+	struct net_device *netdev = queue->ctrl->ddp_netdev;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	ulp_ddp_teardown(netdev, queue->sock->sk, &req->ddp, rq);
+	sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
+}
+
+static void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
+{
+	struct request *rq = ddp_ctx;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	if (!nvme_try_complete_req(rq, req->ddp_status, req->result))
+		nvme_complete_rq(rq);
+}
+
+static void nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+			       struct request *rq)
+{
+	struct net_device *netdev = queue->ctrl->ddp_netdev;
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	int ret;
+
+	if (rq_data_dir(rq) != READ ||
+	    queue->ctrl->ddp_threshold > blk_rq_payload_bytes(rq))
+		return;
+
+	/*
+	 * DDP offload is best-effort, errors are ignored.
+	 */
+
+	req->ddp.command_id = nvme_cid(rq);
+	req->ddp.sg_table.sgl = req->ddp.first_sgl;
+	ret = sg_alloc_table_chained(&req->ddp.sg_table,
+				     blk_rq_nr_phys_segments(rq),
+				     req->ddp.sg_table.sgl, SG_CHUNK_SIZE);
+	if (ret)
+		goto err;
+	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
+
+	ret = ulp_ddp_setup(netdev, queue->sock->sk, &req->ddp);
+	if (ret) {
+		sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
+		goto err;
+	}
+
+	/* if successful, sg table is freed in nvme_tcp_teardown_ddp() */
+	req->offloaded = true;
+
+	return;
+err:
+	WARN_ONCE(ret, "ddp setup failed (queue 0x%x, cid 0x%x, ret=%d)",
+		  nvme_tcp_queue_id(queue),
+		  nvme_cid(rq),
+		  ret);
+}
+
 static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue)
 {
 	struct ulp_ddp_config config = {.type = ULP_DDP_NVME};
@@ -427,6 +496,10 @@  static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
 
 #else
 
+static void nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
+			       struct request *rq)
+{}
+
 static void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue)
 {}
 
@@ -705,6 +778,26 @@  static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
 }
 
+static void nvme_tcp_complete_request(struct request *rq,
+				      __le16 status,
+				      union nvme_result result,
+				      __u16 command_id)
+{
+#ifdef CONFIG_ULP_DDP
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+
+	if (req->offloaded) {
+		req->ddp_status = status;
+		req->result = result;
+		nvme_tcp_teardown_ddp(req->queue, rq);
+		return;
+	}
+#endif
+
+	if (!nvme_try_complete_req(rq, status, result))
+		nvme_complete_rq(rq);
+}
+
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 		struct nvme_completion *cqe)
 {
@@ -724,10 +817,9 @@  static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 	if (req->status == cpu_to_le16(NVME_SC_SUCCESS))
 		req->status = cqe->status;
 
-	if (!nvme_try_complete_req(rq, req->status, cqe->result))
-		nvme_complete_rq(rq);
+	nvme_tcp_complete_request(rq, req->status, cqe->result,
+				  cqe->command_id);
 	queue->nr_cqe++;
-
 	return 0;
 }
 
@@ -925,10 +1017,13 @@  static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 
 static inline void nvme_tcp_end_request(struct request *rq, u16 status)
 {
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_tcp_queue *queue = req->queue;
+	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
 	union nvme_result res = {};
 
-	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
-		nvme_complete_rq(rq);
+	nvme_tcp_complete_request(rq, cpu_to_le16(status << 1), res,
+				  pdu->command_id);
 }
 
 static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
@@ -1235,6 +1330,9 @@  static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	else
 		msg.msg_flags |= MSG_EOR;
 
+	if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags))
+		nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req));
+
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
@@ -2541,6 +2639,9 @@  static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_ULP_DDP
+	req->offloaded = false;
+#endif
 	req->state = NVME_TCP_SEND_CMD_PDU;
 	req->status = cpu_to_le16(NVME_SC_SUCCESS);
 	req->offset = 0;