Message ID | 20211012181742.672391-10-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Batched completions | expand |
On Tue, Oct 12, 2021 at 12:17:42PM -0600, Jens Axboe wrote: > Trivial to do now, just need our own io_batch on the stack and pass that > in to the usual command completion handling. > > I pondered making this dependent on how many entries we had to process, > but even for a single entry there's no discernable difference in > performance or latency. Running a sync workload over io_uring: > > t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1 > > yields the below performance before the patch: > > IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) > IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) > IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) > > and the following after: > > IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) > IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1) > IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) > > which definitely isn't slower, about the same if you factor in a bit of > variance. For peak performance workloads, benchmarking shows a 2% > improvement. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > drivers/nvme/host/pci.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 4713da708cd4..fb3de6f68eb1 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1076,8 +1076,10 @@ static inline void nvme_update_cq_head(struct nvme_queue *nvmeq) > > static inline int nvme_process_cq(struct nvme_queue *nvmeq) > { > + struct io_batch ib; > int found = 0; > > + ib.req_list = NULL; Is this really more efficient than struct io_batch ib = { }; ?
On 10/13/21 1:12 AM, Christoph Hellwig wrote: > On Tue, Oct 12, 2021 at 12:17:42PM -0600, Jens Axboe wrote: >> Trivial to do now, just need our own io_batch on the stack and pass that >> in to the usual command completion handling. >> >> I pondered making this dependent on how many entries we had to process, >> but even for a single entry there's no discernable difference in >> performance or latency. Running a sync workload over io_uring: >> >> t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1 >> >> yields the below performance before the patch: >> >> IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) >> IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) >> IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) >> >> and the following after: >> >> IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) >> IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1) >> IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) >> >> which definitely isn't slower, about the same if you factor in a bit of >> variance. For peak performance workloads, benchmarking shows a 2% >> improvement. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> drivers/nvme/host/pci.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c >> index 4713da708cd4..fb3de6f68eb1 100644 >> --- a/drivers/nvme/host/pci.c >> +++ b/drivers/nvme/host/pci.c >> @@ -1076,8 +1076,10 @@ static inline void nvme_update_cq_head(struct nvme_queue *nvmeq) >> >> static inline int nvme_process_cq(struct nvme_queue *nvmeq) >> { >> + struct io_batch ib; >> int found = 0; >> >> + ib.req_list = NULL; > > Is this really more efficient than > > struct io_batch ib = { }; Probably not. I could add a DEFINE_IO_BATCH() helper, would make it easier if other kinds of init is ever needed.
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 4713da708cd4..fb3de6f68eb1 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1076,8 +1076,10 @@ static inline void nvme_update_cq_head(struct nvme_queue *nvmeq) static inline int nvme_process_cq(struct nvme_queue *nvmeq) { + struct io_batch ib; int found = 0; + ib.req_list = NULL; while (nvme_cqe_pending(nvmeq)) { found++; /* @@ -1085,12 +1087,15 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq) * the cqe requires a full read memory barrier */ dma_rmb(); - nvme_handle_cqe(nvmeq, NULL, nvmeq->cq_head); + nvme_handle_cqe(nvmeq, &ib, nvmeq->cq_head); nvme_update_cq_head(nvmeq); } - if (found) + if (found) { + if (ib.req_list) + nvme_pci_complete_batch(&ib); nvme_ring_cq_doorbell(nvmeq); + } return found; }
Trivial to do now, just need our own io_batch on the stack and pass that in to the usual command completion handling. I pondered making this dependent on how many entries we had to process, but even for a single entry there's no discernable difference in performance or latency. Running a sync workload over io_uring: t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1 yields the below performance before the patch: IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) and the following after: IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) which definitely isn't slower, about the same if you factor in a bit of variance. For peak performance workloads, benchmarking shows a 2% improvement. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- drivers/nvme/host/pci.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)