[9/9] nvme: wire up completion batching for the IRQ path

Message ID	20211012181742.672391-10-axboe@kernel.dk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Jens Axboe <axboe@kernel.dk> To: linux-block@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Subject: [PATCH 9/9] nvme: wire up completion batching for the IRQ path Date: Tue, 12 Oct 2021 12:17:42 -0600 Message-Id: <20211012181742.672391-10-axboe@kernel.dk> In-Reply-To: <20211012181742.672391-1-axboe@kernel.dk> References: <20211012181742.672391-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Batched completions \| expand [PATCHSET,0/9] Batched completions [1/9] block: add a struct io_batch argument to fops->iopoll() [2/9] sbitmap: add helper to clear a batch of tags [3/9] sbitmap: test bit before calling test_and_set_bit() [4/9] block: add support for blk_mq_end_request_batch() [5/9] nvme: move the fast path nvme error and disposition helpers [6/9] nvme: add support for batched completion of polled IO [7/9] block: assign batch completion handler in blk_poll() [8/9] io_uring: utilize the io_batch infrastructure for more efficient polled IO [9/9] nvme: wire up completion batching for the IRQ path

Message ID

20211012181742.672391-10-axboe@kernel.dk (mailing list archive)

State

New, archived

Headers

From: Jens Axboe <axboe@kernel.dk>
To: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 9/9] nvme: wire up completion batching for the IRQ path
Date: Tue, 12 Oct 2021 12:17:42 -0600
Message-Id: <20211012181742.672391-10-axboe@kernel.dk>
In-Reply-To: <20211012181742.672391-1-axboe@kernel.dk>
References: <20211012181742.672391-1-axboe@kernel.dk>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Batched completions | expand

Commit Message

Jens Axboe Oct. 12, 2021, 6:17 p.m. UTC

Trivial to do now, just need our own io_batch on the stack and pass that
in to the usual command completion handling.

I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:

t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1

yields the below performance before the patch:

IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

and the following after:

IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Oct. 13, 2021, 7:12 a.m. UTC | #1

On Tue, Oct 12, 2021 at 12:17:42PM -0600, Jens Axboe wrote:
> Trivial to do now, just need our own io_batch on the stack and pass that
> in to the usual command completion handling.
> 
> I pondered making this dependent on how many entries we had to process,
> but even for a single entry there's no discernable difference in
> performance or latency. Running a sync workload over io_uring:
> 
> t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
> 
> yields the below performance before the patch:
> 
> IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
> IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
> IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
> 
> and the following after:
> 
> IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
> IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
> IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
> 
> which definitely isn't slower, about the same if you factor in a bit of
> variance. For peak performance workloads, benchmarking shows a 2%
> improvement.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  drivers/nvme/host/pci.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 4713da708cd4..fb3de6f68eb1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1076,8 +1076,10 @@ static inline void nvme_update_cq_head(struct nvme_queue *nvmeq)
>  
>  static inline int nvme_process_cq(struct nvme_queue *nvmeq)
>  {
> +	struct io_batch ib;
>  	int found = 0;
>  
> +	ib.req_list = NULL;

Is this really more efficient than

	struct io_batch ib = { };

?

Jens Axboe Oct. 13, 2021, 3:04 p.m. UTC | #2

On 10/13/21 1:12 AM, Christoph Hellwig wrote:
> On Tue, Oct 12, 2021 at 12:17:42PM -0600, Jens Axboe wrote:
>> Trivial to do now, just need our own io_batch on the stack and pass that
>> in to the usual command completion handling.
>>
>> I pondered making this dependent on how many entries we had to process,
>> but even for a single entry there's no discernable difference in
>> performance or latency. Running a sync workload over io_uring:
>>
>> t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
>>
>> yields the below performance before the patch:
>>
>> IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
>> IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
>> IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
>>
>> and the following after:
>>
>> IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
>> IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
>> IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
>>
>> which definitely isn't slower, about the same if you factor in a bit of
>> variance. For peak performance workloads, benchmarking shows a 2%
>> improvement.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  drivers/nvme/host/pci.c | 9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 4713da708cd4..fb3de6f68eb1 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1076,8 +1076,10 @@ static inline void nvme_update_cq_head(struct nvme_queue *nvmeq)
>>  
>>  static inline int nvme_process_cq(struct nvme_queue *nvmeq)
>>  {
>> +	struct io_batch ib;
>>  	int found = 0;
>>  
>> +	ib.req_list = NULL;
> 
> Is this really more efficient than
> 
> 	struct io_batch ib = { };

Probably not. I could add a DEFINE_IO_BATCH() helper, would make it easier if
other kinds of init is ever needed.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4713da708cd4..fb3de6f68eb1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1076,8 +1076,10 @@  static inline void nvme_update_cq_head(struct nvme_queue *nvmeq)
 
 static inline int nvme_process_cq(struct nvme_queue *nvmeq)
 {
+	struct io_batch ib;
 	int found = 0;
 
+	ib.req_list = NULL;
 	while (nvme_cqe_pending(nvmeq)) {
 		found++;
 		/*
@@ -1085,12 +1087,15 @@  static inline int nvme_process_cq(struct nvme_queue *nvmeq)
 		 * the cqe requires a full read memory barrier
 		 */
 		dma_rmb();
-		nvme_handle_cqe(nvmeq, NULL, nvmeq->cq_head);
+		nvme_handle_cqe(nvmeq, &ib, nvmeq->cq_head);
 		nvme_update_cq_head(nvmeq);
 	}
 
-	if (found)
+	if (found) {
+		if (ib.req_list)
+			nvme_pci_complete_batch(&ib);
 		nvme_ring_cq_doorbell(nvmeq);
+	}
 	return found;
 }

[9/9] nvme: wire up completion batching for the IRQ path

Commit Message

Comments

Patch