Message ID | 20230912095949.5474-7-aaptel@nvidia.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | nvme-tcp receive offloads | expand |
> @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) > else > msg.msg_flags |= MSG_EOR; > > + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) > + nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req)); Didn't we agree that setup_ddp should move to setup time and not send time?
Sagi Grimberg <sagi@grimberg.me> writes: >> @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) >> else >> msg.msg_flags |= MSG_EOR; >> >> + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) >> + nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req)); > > Didn't we agree that setup_ddp should move to setup time and > not send time? We believe we haven't reached a conclusion last time [1]. Moving the setup_ddp() call earlier at setup time is less efficient (up to ~15% less IOPS) when it does the work on a different CPU. 1: https://lore.kernel.org/all/253h6oxvlwd.fsf@nvidia.com/
>>> @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) >>> else >>> msg.msg_flags |= MSG_EOR; >>> >>> + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) >>> + nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req)); >> >> Didn't we agree that setup_ddp should move to setup time and >> not send time? > > We believe we haven't reached a conclusion last time [1]. > > Moving the setup_ddp() call earlier at setup time is less efficient (up > to ~15% less IOPS) when it does the work on a different CPU. > > 1: https://lore.kernel.org/all/253h6oxvlwd.fsf@nvidia.com/ Can you please explain why? sk_incoming_cpu is updated from the network recv path while you are arguing that the timing matters before you even send the pdu. I don't understand why should that matter.
Sagi Grimberg <sagi@grimberg.me> writes: > Can you please explain why? sk_incoming_cpu is updated from the network > recv path while you are arguing that the timing matters before you even > send the pdu. I don't understand why should that matter. Sorry, the original answer was misleading. The problem is not about the timing but only about which CPU the code is running on. If we move setup_ddp() earlier as you suggested, it can result it running on the wrong CPU. Calling setup_ddp() in nvme_tcp_setup_cmd_pdu() will not guarantee we are on running on the queue->io_cpu. It's only during nvme_tcp_queue_request() that we either know we are running on queue->io_cpu, or dispatch it to run on queue->io_cpu. As it is only a performance optimization for the non-likely case, we can move it to nvme_tcp_setup_cmd_pdu() as you suggested and re-consider in the future if it will be needed. Thanks
>> Can you please explain why? sk_incoming_cpu is updated from the network >> recv path while you are arguing that the timing matters before you even >> send the pdu. I don't understand why should that matter. > > Sorry, the original answer was misleading. > The problem is not about the timing but only about which CPU the code is > running on. If we move setup_ddp() earlier as you suggested, it can > result it running on the wrong CPU. Please define wrong CPU. > Calling setup_ddp() in nvme_tcp_setup_cmd_pdu() will not guarantee we > are on running on the queue->io_cpu. > It's only during nvme_tcp_queue_request() that we either know we are running on > queue->io_cpu, or dispatch it to run on queue->io_cpu. But the sk_incmoing_cpu is updated with the cpu that is reading the socket, so in fact it should converge to the io_cpu - shouldn't it? Can you please provide a concrete explanation to the performance degradation? > As it is only a performance optimization for the non-likely case, we can > move it to nvme_tcp_setup_cmd_pdu() as you suggested and re-consider in > the future if it will be needed. Would still like to understand this case.
Sagi Grimberg <sagi@grimberg.me> writes: >> Sorry, the original answer was misleading. >> The problem is not about the timing but only about which CPU the code is >> running on. If we move setup_ddp() earlier as you suggested, it can >> result it running on the wrong CPU. > > Please define wrong CPU. Let's say we connect with 1 IO queue on CPU 0. We run our application which run IOs on multiple CPU cores (0 and 7 as an example). Whenever the IO was issued on CPU 7, setup_cmd_pdu() and queue_request() will be run in the context of CPU 7. We consider CPU 7 "wrong", because it isn't q->io_cpu (CPU 0). It's only after queue_request() dispatches it that it will it run on CPU 0. > But the sk_incmoing_cpu is updated with the cpu that is reading the > socket, so in fact it should converge to the io_cpu - shouldn't it? Yes, that is true. > Can you please provide a concrete explanation to the performance > degradation? We believe the setup_ddp should be called from the CPU core on which the nvme queue was created so all the IO path SW-HW interaction will run on the same CPU core. The performance degradation is relevant only to specific cases in which the application will run on the "wrong" CPU core on which the NVMe queue was not created. If you don’t see it as a problem, we can move the setup_ddp to setup_cmd_pdu(). Thanks
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index f8322a07e27e..48bc8afe272d 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -112,6 +112,13 @@ struct nvme_tcp_request { size_t offset; size_t data_sent; enum nvme_tcp_send_state state; + +#ifdef CONFIG_ULP_DDP + bool offloaded; + struct ulp_ddp_io ddp; + __le16 ddp_status; + union nvme_result result; +#endif }; enum nvme_tcp_queue_flags { @@ -319,10 +326,72 @@ static bool nvme_tcp_ddp_query_limits(struct nvme_tcp_ctrl *ctrl) } static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags); +static void nvme_tcp_ddp_teardown_done(void *ddp_ctx); static const struct ulp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = { .resync_request = nvme_tcp_resync_request, + .ddp_teardown_done = nvme_tcp_ddp_teardown_done, }; +static void nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue, + struct request *rq) +{ + struct net_device *netdev = queue->ctrl->ddp_netdev; + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); + + ulp_ddp_teardown(netdev, queue->sock->sk, &req->ddp, rq); + sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE); +} + +static void nvme_tcp_ddp_teardown_done(void *ddp_ctx) +{ + struct request *rq = ddp_ctx; + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); + + if (!nvme_try_complete_req(rq, req->ddp_status, req->result)) + nvme_complete_rq(rq); +} + +static void nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue, + struct request *rq) +{ + struct net_device *netdev = queue->ctrl->ddp_netdev; + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); + int ret; + + if (rq_data_dir(rq) != READ || + queue->ctrl->ddp_threshold > blk_rq_payload_bytes(rq)) + return; + + /* + * DDP offload is best-effort, errors are ignored. + */ + + req->ddp.command_id = nvme_cid(rq); + req->ddp.sg_table.sgl = req->ddp.first_sgl; + ret = sg_alloc_table_chained(&req->ddp.sg_table, + blk_rq_nr_phys_segments(rq), + req->ddp.sg_table.sgl, SG_CHUNK_SIZE); + if (ret) + goto err; + req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl); + + ret = ulp_ddp_setup(netdev, queue->sock->sk, &req->ddp); + if (ret) { + sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE); + goto err; + } + + /* if successful, sg table is freed in nvme_tcp_teardown_ddp() */ + req->offloaded = true; + + return; +err: + WARN_ONCE(ret, "ddp setup failed (queue 0x%x, cid 0x%x, ret=%d)", + nvme_tcp_queue_id(queue), + nvme_cid(rq), + ret); +} + static int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue) { struct ulp_ddp_config config = {.type = ULP_DDP_NVME}; @@ -427,6 +496,10 @@ static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags) #else +static void nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue, + struct request *rq) +{} + static void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue) {} @@ -705,6 +778,26 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl) queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work); } +static void nvme_tcp_complete_request(struct request *rq, + __le16 status, + union nvme_result result, + __u16 command_id) +{ +#ifdef CONFIG_ULP_DDP + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); + + if (req->offloaded) { + req->ddp_status = status; + req->result = result; + nvme_tcp_teardown_ddp(req->queue, rq); + return; + } +#endif + + if (!nvme_try_complete_req(rq, status, result)) + nvme_complete_rq(rq); +} + static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, struct nvme_completion *cqe) { @@ -724,10 +817,9 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, if (req->status == cpu_to_le16(NVME_SC_SUCCESS)) req->status = cqe->status; - if (!nvme_try_complete_req(rq, req->status, cqe->result)) - nvme_complete_rq(rq); + nvme_tcp_complete_request(rq, req->status, cqe->result, + cqe->command_id); queue->nr_cqe++; - return 0; } @@ -925,10 +1017,13 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb, static inline void nvme_tcp_end_request(struct request *rq, u16 status) { + struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); + struct nvme_tcp_queue *queue = req->queue; + struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu; union nvme_result res = {}; - if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res)) - nvme_complete_rq(rq); + nvme_tcp_complete_request(rq, cpu_to_le16(status << 1), res, + pdu->command_id); } static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, @@ -1235,6 +1330,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) else msg.msg_flags |= MSG_EOR; + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) + nvme_tcp_setup_ddp(queue, blk_mq_rq_from_pdu(req)); + if (queue->hdr_digest && !req->offset) nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu)); @@ -2541,6 +2639,9 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns, if (ret) return ret; +#ifdef CONFIG_ULP_DDP + req->offloaded = false; +#endif req->state = NVME_TCP_SEND_CMD_PDU; req->status = cpu_to_le16(NVME_SC_SUCCESS); req->offset = 0;