[PATCHv2,1/3] block: introduce rq_list_for_each_safe macro

Message ID	20211227164138.2488066-1-kbusch@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Keith Busch <kbusch@kernel.org> To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, axboe@kernel.dk Cc: hch@lst.de, sagi@grimberg.me, Keith Busch <kbusch@kernel.org> Subject: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro Date: Mon, 27 Dec 2021 08:41:36 -0800 Message-Id: <20211227164138.2488066-1-kbusch@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[PATCHv2,1/3] block: introduce rq_list_for_each_safe macro \| expand [PATCHv2,1/3] block: introduce rq_list_for_each_safe macro [PATCHv2,2/3] block: introduce rq_list_move [PATCHv2,3/3] nvme-pci: fix queue_rqs list splitting

Keith Busch Dec. 27, 2021, 4:41 p.m. UTC

While iterating a list, a particular request may need to be removed for
special handling. Provide an iterator that can safely handle that.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/linux/blkdev.h | 4 ++++
 1 file changed, 4 insertions(+)

Christoph Hellwig Dec. 29, 2021, 5:39 p.m. UTC | #1

On Mon, Dec 27, 2021 at 08:41:36AM -0800, Keith Busch wrote:
> While iterating a list, a particular request may need to be removed for
> special handling. Provide an iterator that can safely handle that.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

(except for the fact that it, just like the other rq_list helpers
really should go into blk-mq.h, where all request related bits moved
early in this cycle)

Keith Busch Dec. 29, 2021, 8:57 p.m. UTC | #2

On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
> (except for the fact that it, just like the other rq_list helpers
> really should go into blk-mq.h, where all request related bits moved
> early in this cycle)

Agreed, I just put it here because it's where the other macros live. But
'struct request' doesn't exist in this header, so it does seem out of
place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.

Max Gurtovoy Dec. 30, 2021, 2:38 p.m. UTC | #3

On 12/29/2021 10:57 PM, Keith Busch wrote:
> On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
>> (except for the fact that it, just like the other rq_list helpers
>> really should go into blk-mq.h, where all request related bits moved
>> early in this cycle)
> Agreed, I just put it here because it's where the other macros live. But
> 'struct request' doesn't exist in this header, so it does seem out of
> place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.

Did you see the discussion I had with Jens ?

Seems it works only for non-shared tagsets and it means that it will 
work only for NVMe devices that have 1 namespace and will not work for 
NVMf drivers.

Unless, we'll do some changes in the block layer and/or remove this 
condition.

Keith Busch Dec. 30, 2021, 3:30 p.m. UTC | #4

On Thu, Dec 30, 2021 at 04:38:57PM +0200, Max Gurtovoy wrote:
> 
> On 12/29/2021 10:57 PM, Keith Busch wrote:
> > On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
> > > (except for the fact that it, just like the other rq_list helpers
> > > really should go into blk-mq.h, where all request related bits moved
> > > early in this cycle)
> > Agreed, I just put it here because it's where the other macros live. But
> > 'struct request' doesn't exist in this header, so it does seem out of
> > place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.
> 
> Did you see the discussion I had with Jens ?

I did, yes. That's when I noticed the error handling wasn't right, and
doing it correctly was a bit clunky. This series was supposed to make it
easier for drivers to use the new interface.

> Seems it works only for non-shared tagsets and it means that it will work
> only for NVMe devices that have 1 namespace and will not work for NVMf
> drivers.

I am aware shared tags don't get to use the batched queueing, but that's
orthogonal to this series.

Most PCIe NVMe SSDs only have 1 namespace, so it generally works out for
those. 

> Unless, we'll do some changes in the block layer and/or remove this
> condition.

I think it just may work if we export blk_mq_get_driver_tag().

Max Gurtovoy Jan. 3, 2022, 3:23 p.m. UTC | #5

On 12/30/2021 5:30 PM, Keith Busch wrote:
> On Thu, Dec 30, 2021 at 04:38:57PM +0200, Max Gurtovoy wrote:
>> On 12/29/2021 10:57 PM, Keith Busch wrote:
>>> On Wed, Dec 29, 2021 at 06:39:02PM +0100, Christoph Hellwig wrote:
>>>> (except for the fact that it, just like the other rq_list helpers
>>>> really should go into blk-mq.h, where all request related bits moved
>>>> early in this cycle)
>>> Agreed, I just put it here because it's where the other macros live. But
>>> 'struct request' doesn't exist in this header, so it does seem out of
>>> place.  For v3, I'll add a preceding patch to move them all to blk-mq.h.
>> Did you see the discussion I had with Jens ?
> I did, yes. That's when I noticed the error handling wasn't right, and
> doing it correctly was a bit clunky. This series was supposed to make it
> easier for drivers to use the new interface.
>   
>> Seems it works only for non-shared tagsets and it means that it will work
>> only for NVMe devices that have 1 namespace and will not work for NVMf
>> drivers.
> I am aware shared tags don't get to use the batched queueing, but that's
> orthogonal to this series.

Yes. This series probably should be squashed to Jens's and merged together.

> Most PCIe NVMe SSDs only have 1 namespace, so it generally works out for
> those.

There are SSDs that support NS management and also devices that support 
multiple NS (NVIDIA's NVMe SNAP device support many NSs).

Also all the fabrics controllers support it.

I don't think we need to restrict capable controllers from not working 
with the batching feature from day 1.

>
>> Unless, we'll do some changes in the block layer and/or remove this
>> condition.
> I think it just may work if we export blk_mq_get_driver_tag().

do you have a suggestion for the NVMe/PCI driver ?

I can run it in my lab if you don't have an SSD with more than 1 NS..

Keith Busch Jan. 3, 2022, 6:15 p.m. UTC | #6

On Mon, Jan 03, 2022 at 05:23:08PM +0200, Max Gurtovoy wrote:
> On 12/30/2021 5:30 PM, Keith Busch wrote:
> > I think it just may work if we export blk_mq_get_driver_tag().
> 
> do you have a suggestion for the NVMe/PCI driver ?

The following tests fine with my multi-namespace setups. I have real
hardware with namespace management capabilities, but qemu can also
easily emulate it too for anyone who doesn't have one.

---
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 5668e28be0b7..84f2e73d0c7c 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -41,12 +41,6 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 	return sbq_wait_ptr(bt, &hctx->wait_index);
 }
 
-enum {
-	BLK_MQ_NO_TAG		= -1U,
-	BLK_MQ_TAG_MIN		= 1,
-	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
-};
-
 extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
 extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0d7c9d3e0329..b4540723077a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1589,6 +1589,7 @@ bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	hctx->tags->rqs[rq->tag] = rq;
 	return true;
 }
+EXPORT_SYMBOL_GPL(__blk_mq_get_driver_tag);
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 				int flags, void *key)
@@ -2582,11 +2583,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		 * same queue, caller must ensure that's the case.
 		 *
 		 * Since we pass off the full list to the driver at this point,
-		 * we do not increment the active request count for the queue.
-		 * Bypass shared tags for now because of that.
+		 * we are counting on the driver to increment the active
+		 * request count for the queue.
 		 */
-		if (q->mq_ops->queue_rqs &&
-		    !(rq->mq_hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		if (q->mq_ops->queue_rqs) {
 			blk_mq_run_dispatch_ops(q,
 				__blk_mq_flush_plug_list(q, plug));
 			if (rq_list_empty(plug->mq_list))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 948791ea2a3e..0f37ae906901 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -268,21 +268,6 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
 	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
 }
 
-bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
-
-static inline bool blk_mq_get_driver_tag(struct request *rq)
-{
-	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-
-	if (rq->tag != BLK_MQ_NO_TAG &&
-	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		hctx->tags->rqs[rq->tag] = rq;
-		return true;
-	}
-
-	return __blk_mq_get_driver_tag(hctx, rq);
-}
-
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 50deb8b69c40..f50483475c12 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -992,8 +992,9 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 		return false;
 	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
 		return false;
+	if (!blk_mq_get_driver_tag(req))
+		return false;
 
-	req->mq_hctx->tags->rqs[req->tag] = req;
 	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 550996cf419c..8fb544a35330 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1072,6 +1072,27 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+enum {
+	BLK_MQ_NO_TAG		= -1U,
+	BLK_MQ_TAG_MIN		= 1,
+	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
+};
+
+bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
+
+static inline bool blk_mq_get_driver_tag(struct request *rq)
+{
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+
+	if (rq->tag != BLK_MQ_NO_TAG &&
+	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		hctx->tags->rqs[rq->tag] = rq;
+		return true;
+	}
+
+	return __blk_mq_get_driver_tag(hctx, rq);
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline unsigned int blk_rq_zone_no(struct request *rq)
 {
--

Max Gurtovoy Jan. 4, 2022, 12:15 p.m. UTC | #7

On 1/3/2022 8:15 PM, Keith Busch wrote:
> On Mon, Jan 03, 2022 at 05:23:08PM +0200, Max Gurtovoy wrote:
>> On 12/30/2021 5:30 PM, Keith Busch wrote:
>>> I think it just may work if we export blk_mq_get_driver_tag().
>> do you have a suggestion for the NVMe/PCI driver ?
> The following tests fine with my multi-namespace setups. I have real
> hardware with namespace management capabilities, but qemu can also
> easily emulate it too for anyone who doesn't have one.
>
> ---
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 5668e28be0b7..84f2e73d0c7c 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -41,12 +41,6 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>   	return sbq_wait_ptr(bt, &hctx->wait_index);
>   }
>   
> -enum {
> -	BLK_MQ_NO_TAG		= -1U,
> -	BLK_MQ_TAG_MIN		= 1,
> -	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
> -};
> -
>   extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
>   extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0d7c9d3e0329..b4540723077a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1589,6 +1589,7 @@ bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq)
>   	hctx->tags->rqs[rq->tag] = rq;
>   	return true;
>   }
> +EXPORT_SYMBOL_GPL(__blk_mq_get_driver_tag);
>   
>   static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
>   				int flags, void *key)
> @@ -2582,11 +2583,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>   		 * same queue, caller must ensure that's the case.
>   		 *
>   		 * Since we pass off the full list to the driver at this point,
> -		 * we do not increment the active request count for the queue.
> -		 * Bypass shared tags for now because of that.
> +		 * we are counting on the driver to increment the active
> +		 * request count for the queue.
>   		 */
> -		if (q->mq_ops->queue_rqs &&
> -		    !(rq->mq_hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> +		if (q->mq_ops->queue_rqs) {
>   			blk_mq_run_dispatch_ops(q,
>   				__blk_mq_flush_plug_list(q, plug));
>   			if (rq_list_empty(plug->mq_list))
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 948791ea2a3e..0f37ae906901 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -268,21 +268,6 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
>   	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
>   }
>   
> -bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
> -
> -static inline bool blk_mq_get_driver_tag(struct request *rq)
> -{
> -	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> -
> -	if (rq->tag != BLK_MQ_NO_TAG &&
> -	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> -		hctx->tags->rqs[rq->tag] = rq;
> -		return true;
> -	}
> -
> -	return __blk_mq_get_driver_tag(hctx, rq);
> -}
> -
>   static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
>   {
>   	int cpu;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 50deb8b69c40..f50483475c12 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -992,8 +992,9 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
>   		return false;
>   	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
>   		return false;
> +	if (!blk_mq_get_driver_tag(req))
> +		return false;
>   
> -	req->mq_hctx->tags->rqs[req->tag] = req;
>   	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
>   }
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 550996cf419c..8fb544a35330 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1072,6 +1072,27 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
>   }
>   void blk_dump_rq_flags(struct request *, char *);
>   
> +enum {
> +	BLK_MQ_NO_TAG		= -1U,
> +	BLK_MQ_TAG_MIN		= 1,
> +	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
> +};
> +
> +bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq);
> +
> +static inline bool blk_mq_get_driver_tag(struct request *rq)
> +{
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +
> +	if (rq->tag != BLK_MQ_NO_TAG &&
> +	    !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> +		hctx->tags->rqs[rq->tag] = rq;
> +		return true;
> +	}
> +
> +	return __blk_mq_get_driver_tag(hctx, rq);
> +}
> +
>   #ifdef CONFIG_BLK_DEV_ZONED
>   static inline unsigned int blk_rq_zone_no(struct request *rq)
>   {
> --

This patch worked for me with 2 namespaces for NVMe PCI.

I'll check it later on with my RDMA queue_rqs patches as well. There we 
have also a tagset sharing with the connect_q (and not only with 
multiple namespaces).

But the connect_q is using a reserved tags only (for the connect commands).

I saw some strange things that I couldn't understand:

1. running randread fio with libaio ioengine didn't call nvme_queue_rqs 
- expected

*2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - 
Not expected !!*

*3. running randread fio with io_uring ioengine (and --iodepth_batch=32) 
didn't call nvme_queue_rqs - Not expected !!*

4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) 
did call nvme_queue_rqs - expected

5. *running randread fio with io_uring ioengine (and --iodepth_batch=32 
--runtime=30) didn't finish after 30 seconds and stuck for 300 seconds 
(fio jobs required "kill -9 fio" to remove refcounts from nvme_core)   - 
Not expected !!*

*debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300 
seconds, it appears to be stuck. Doing forceful exit of this job.
*

*6. ***running randwrite fio with io_uring ioengine (and  
--iodepth_batch=32 --runtime=30) didn't finish after 30 seconds and 
stuck for 300 seconds (fio jobs required "kill -9 fio" to remove 
refcounts from nvme_core)   - Not expected !!**

***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300 
seconds, it appears to be stuck. Doing forceful exit of this job.***


any idea what could cause these unexpected scenarios ? at least 
unexpected for me :)


******

Keith Busch Jan. 5, 2022, 5:26 p.m. UTC | #8

On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
> 
> This patch worked for me with 2 namespaces for NVMe PCI.
> 
> I'll check it later on with my RDMA queue_rqs patches as well. There we have
> also a tagset sharing with the connect_q (and not only with multiple
> namespaces).
> 
> But the connect_q is using a reserved tags only (for the connect commands).
> 
> I saw some strange things that I couldn't understand:
> 
> 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
> expected
> 
> *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
> expected !!*
> 
> *3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
> didn't call nvme_queue_rqs - Not expected !!*
> 
> 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
> call nvme_queue_rqs - expected
> 
> 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
> expected !!*
> 
> *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
> seconds, it appears to be stuck. Doing forceful exit of this job.
> *
> 
> *6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
> expected !!**
> 
> ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
> seconds, it appears to be stuck. Doing forceful exit of this job.***
> 
> 
> any idea what could cause these unexpected scenarios ? at least unexpected
> for me :)

Not sure about all the scenarios. I believe it should call queue_rqs
anytime we finish a plugged list of requests as long as the requests
come from the same request_queue, and it's not being flushed from
io_schedule().

The stuck fio job might be a lost request, which is what this series
should address. It would be unusual to see such an error happen in
normal operation, though. I had to synthesize errors to verify the bug
and fix.

In any case, I'll run more multi-namespace tests to see if I can find
any other issues with shared tags.

Max Gurtovoy Jan. 6, 2022, 11:54 a.m. UTC | #9

On 1/5/2022 7:26 PM, Keith Busch wrote:
> On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
>> This patch worked for me with 2 namespaces for NVMe PCI.
>>
>> I'll check it later on with my RDMA queue_rqs patches as well. There we have
>> also a tagset sharing with the connect_q (and not only with multiple
>> namespaces).
>>
>> But the connect_q is using a reserved tags only (for the connect commands).
>>
>> I saw some strange things that I couldn't understand:
>>
>> 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
>> expected
>>
>> *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
>> expected !!*
>>
>> *3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
>> didn't call nvme_queue_rqs - Not expected !!*
>>
>> 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
>> call nvme_queue_rqs - expected
>>
>> 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>> expected !!*
>>
>> *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>> seconds, it appears to be stuck. Doing forceful exit of this job.
>> *
>>
>> *6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>> expected !!**
>>
>> ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>> seconds, it appears to be stuck. Doing forceful exit of this job.***
>>
>>
>> any idea what could cause these unexpected scenarios ? at least unexpected
>> for me :)
> Not sure about all the scenarios. I believe it should call queue_rqs
> anytime we finish a plugged list of requests as long as the requests
> come from the same request_queue, and it's not being flushed from
> io_schedule().

I also see we have batch > 1 only in the start of the fio operation. 
After X IO operations batch size == 1 till the end of the fio.

>
> The stuck fio job might be a lost request, which is what this series
> should address. It would be unusual to see such an error happen in
> normal operation, though. I had to synthesize errors to verify the bug
> and fix.

But there is no timeout error and prints in the dmesg.

If there was a timeout prints I would expect the issue might be in the 
local NVMe device, but there isn't.

Also this phenomena doesn't happen with NVMf/RDMA code I developed locally.

>
> In any case, I'll run more multi-namespace tests to see if I can find
> any other issues with shared tags.

I believe that the above concerns are not related to the shared-tags but 
to the entire mechanism.

Jens Axboe Jan. 6, 2022, 1:41 p.m. UTC | #10

On 1/6/22 3:54 AM, Max Gurtovoy wrote:
> 
> On 1/5/2022 7:26 PM, Keith Busch wrote:
>> On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
>>> This patch worked for me with 2 namespaces for NVMe PCI.
>>>
>>> I'll check it later on with my RDMA queue_rqs patches as well. There we have
>>> also a tagset sharing with the connect_q (and not only with multiple
>>> namespaces).
>>>
>>> But the connect_q is using a reserved tags only (for the connect commands).
>>>
>>> I saw some strange things that I couldn't understand:
>>>
>>> 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
>>> expected
>>>
>>> *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
>>> expected !!*
>>>
>>> *3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
>>> didn't call nvme_queue_rqs - Not expected !!*
>>>
>>> 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
>>> call nvme_queue_rqs - expected
>>>
>>> 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
>>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>>> expected !!*
>>>
>>> *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>>> seconds, it appears to be stuck. Doing forceful exit of this job.
>>> *
>>>
>>> *6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
>>> --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
>>> jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
>>> expected !!**
>>>
>>> ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
>>> seconds, it appears to be stuck. Doing forceful exit of this job.***
>>>
>>>
>>> any idea what could cause these unexpected scenarios ? at least unexpected
>>> for me :)
>> Not sure about all the scenarios. I believe it should call queue_rqs
>> anytime we finish a plugged list of requests as long as the requests
>> come from the same request_queue, and it's not being flushed from
>> io_schedule().
> 
> I also see we have batch > 1 only in the start of the fio operation. 
> After X IO operations batch size == 1 till the end of the fio.

There are two settings for completion batch, you're likely not setting
them? That in turn will prevent the submit side from submitting more
than 1, as that's all that's left.

>> The stuck fio job might be a lost request, which is what this series
>> should address. It would be unusual to see such an error happen in
>> normal operation, though. I had to synthesize errors to verify the bug
>> and fix.
> 
> But there is no timeout error and prints in the dmesg.
> 
> If there was a timeout prints I would expect the issue might be in the
> local NVMe device, but there isn't.
> 
> Also this phenomena doesn't happen with NVMf/RDMA code I developed
> locally.

There would only be a timeout if it wasn't lost. Keith's patches fixed a
case where it was simply dropped from the list. As it was never started,
it won't get timed out.

[PATCHv2,1/3] block: introduce rq_list_for_each_safe macro

Commit Message

Comments

Patch