diff mbox

Device or HBA level QD throttling creates randomness in sequetial workload

Message ID e1e827ba633f780b00d070e087204d5c@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kashyap Desai Jan. 30, 2017, 1:52 p.m. UTC
Hi Jens/Omar,

I used git.kernel.dk/linux-block branch - blk-mq-sched (commit
0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in
this thread is resolved.

Now I am seeing MQ and  SQ mode both are resulting in sequential IO pattern
while IO is getting re-queued in block layer.

To make similar performance without blk-mq-sched feature, is it good to
pause IO for few usec in LLD?
I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if
it is Sequential on Rotational media.)

Explaining w.r.t megaraid_sas driver.  This driver expose can_queue, but it
internally consume commands for raid 1, fast  path.
In worst case, can_queue/2 will consume all firmware resources and driver
will re-queue further IOs to SML as below -

   if (atomic_inc_return(&instance->fw_outstanding) >
           instance->host->can_queue) {
       atomic_dec(&instance->fw_outstanding);
       return SCSI_MLQUEUE_HOST_BUSY;
   }

I want to avoid above SCSI_MLQUEUE_HOST_BUSY.

Need your suggestion for below changes -


@@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
        return SCSI_MLQUEUE_DEVICE_BUSY;
    }

-   if (atomic_inc_return(&instance->fw_outstanding) >
-           instance->host->can_queue) {
-       atomic_dec(&instance->fw_outstanding);
-       return SCSI_MLQUEUE_HOST_BUSY;
-   }
+   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
+       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
+       /* For rotational device wait for sometime to get fusion command
from pool.
+        * This is just to reduce proactive re-queue at mid layer which is
not
+        * sending sorted IO in SCSI.MQ mode.
+        */
+       if (!is_nonrot)
+           udelay(100);
+   }

    cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);

` Kashyap

> -----Original Message-----
> From: Kashyap Desai [mailto:kashyap.desai@broadcom.com]
> Sent: Tuesday, November 01, 2016 11:11 AM
> To: 'Jens Axboe'; 'Omar Sandoval'
> Cc: 'linux-scsi@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; 'linux-
> block@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.valente@linaro.org'
> Subject: RE: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> Jens- Replied inline.
>
>
> Omar -  I tested your WIP repo and figure out System hangs only if I pass
> "
> scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I
> am looking for scsi_mod.use_blk_mq=Y.
>
> Also below is snippet of blktrace. In case of higher per device QD, I see
> Requeue request in blktrace.
>
> 65,128 10     6268     2.432404509 18594  P   N [fio]
>  65,128 10     6269     2.432405013 18594  U   N [fio] 1
>  65,128 10     6270     2.432405143 18594  I  WS 148800 + 8 [fio]
>  65,128 10     6271     2.432405740 18594  R  WS 148800 + 8 [0]
>  65,128 10     6272     2.432409794 18594  Q  WS 148808 + 8 [fio]
>  65,128 10     6273     2.432410234 18594  G  WS 148808 + 8 [fio]
>  65,128 10     6274     2.432410424 18594  S  WS 148808 + 8 [fio]
>  65,128 23     3626     2.432432595 16232  D  WS 148800 + 8
> [kworker/23:1H]
>  65,128 22     3279     2.432973482     0  C  WS 147432 + 8 [0]
>  65,128  7     6126     2.433032637 18594  P   N [fio]
>  65,128  7     6127     2.433033204 18594  U   N [fio] 1
>  65,128  7     6128     2.433033346 18594  I  WS 148808 + 8 [fio]
>  65,128  7     6129     2.433033871 18594  D  WS 148808 + 8 [fio]
>  65,128  7     6130     2.433034559 18594  R  WS 148808 + 8 [0]
>  65,128  7     6131     2.433039796 18594  Q  WS 148816 + 8 [fio]
>  65,128  7     6132     2.433040206 18594  G  WS 148816 + 8 [fio]
>  65,128  7     6133     2.433040351 18594  S  WS 148816 + 8 [fio]
>  65,128  9     6392     2.433133729     0  C  WS 147240 + 8 [0]
>  65,128  9     6393     2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
>  65,128  7     6134     2.433167450 18594  P   N [fio]
>  65,128  7     6135     2.433167911 18594  U   N [fio] 1
>  65,128  7     6136     2.433168074 18594  I  WS 148816 + 8 [fio]
>  65,128  7     6137     2.433168492 18594  D  WS 148816 + 8 [fio]
>  65,128  7     6138     2.433174016 18594  Q  WS 148824 + 8 [fio]
>  65,128  7     6139     2.433174282 18594  G  WS 148824 + 8 [fio]
>  65,128  7     6140     2.433174613 18594  S  WS 148824 + 8 [fio]
> CPU0 (sdy):
>  Reads Queued:           0,        0KiB  Writes Queued:          79,
> 316KiB
>  Read Dispatches:        0,        0KiB  Write Dispatches:       67,
> 18,446,744,073PiB
>  Reads Requeued:         0               Writes Requeued:        86
>  Reads Completed:        0,        0KiB  Writes Completed:       98,
> 392KiB
>  Read Merges:            0,        0KiB  Write Merges:            0,
> 0KiB
>  Read depth:             0               Write depth:             5
>  IO unplugs:            79               Timer unplugs:           0
>
>
>
> ` Kashyap
>
> > -----Original Message-----
> > From: Jens Axboe [mailto:axboe@kernel.dk]
> > Sent: Monday, October 31, 2016 10:54 PM
> > To: Kashyap Desai; Omar Sandoval
> > Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> > block@vger.kernel.org; Christoph Hellwig; paolo.valente@linaro.org
> > Subject: Re: Device or HBA level QD throttling creates randomness in
> > sequetial workload
> >
> > Hi,
> >
> > One guess would be that this isn't around a requeue condition, but
> > rather the fact that we don't really guarantee any sort of hard FIFO
> > behavior between the software queues. Can you try this test patch to
> > see if it changes the behavior for you? Warning: untested...
>
> Jens - I tested the patch, but I still see random IO pattern for expected
> Sequential Run. I am intentionally running case of Re-queue  and seeing
> issue at the time of Re-queue.
> If there is no Requeue, I see no issue at LLD.
>
>
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > f3d27a6dee09..5404ca9c71b2
> > 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -772,6 +772,14 @@ static inline unsigned int
> > queued_to_index(unsigned int
> > queued)
> >   	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >   }
> >
> > +static int rq_pos_cmp(void *priv, struct list_head *a, struct
> > +list_head
> > +*b) {
> > +	struct request *rqa = container_of(a, struct request, queuelist);
> > +	struct request *rqb = container_of(b, struct request, queuelist);
> > +
> > +	return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> > +
> >   /*
> >    * Run this hardware queue, pulling any software queues mapped to it
> > in.
> >    * Note that this function currently has various problems around
> > ordering @@ -
> > 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct
> > blk_mq_hw_ctx
> > *hctx)
> >   	}
> >
> >   	/*
> > +	 * If the device is rotational, sort the list sanely to avoid
> > +	 * unecessary seeks. The software queues are roughly FIFO, but
> > +	 * only roughly, there are no hard guarantees.
> > +	 */
> > +	if (!blk_queue_nonrot(q))
> > +		list_sort(NULL, &rq_list, rq_pos_cmp);
> > +
> > +	/*
> >   	 * Start off with dptr being NULL, so we start the first request
> >   	 * immediately, even if we have more pending.
> >   	 */
> >
> > --
> > Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bart Van Assche Jan. 30, 2017, 4:30 p.m. UTC | #1
On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> -   if (atomic_inc_return(&instance->fw_outstanding) >
> -           instance->host->can_queue) {
> -       atomic_dec(&instance->fw_outstanding);
> -       return SCSI_MLQUEUE_HOST_BUSY;
> -   }
> +   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> +       /* For rotational device wait for sometime to get fusion command
> from pool.
> +        * This is just to reduce proactive re-queue at mid layer which is
> not
> +        * sending sorted IO in SCSI.MQ mode.
> +        */
> +       if (!is_nonrot)
> +           udelay(100);
> +   }

The SCSI core does not allow to sleep inside the queuecommand() callback
function.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Jan. 30, 2017, 4:32 p.m. UTC | #2
On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
>> -   if (atomic_inc_return(&instance->fw_outstanding) >
>> -           instance->host->can_queue) {
>> -       atomic_dec(&instance->fw_outstanding);
>> -       return SCSI_MLQUEUE_HOST_BUSY;
>> -   }
>> +   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
>> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
>> +       /* For rotational device wait for sometime to get fusion command
>> from pool.
>> +        * This is just to reduce proactive re-queue at mid layer which is
>> not
>> +        * sending sorted IO in SCSI.MQ mode.
>> +        */
>> +       if (!is_nonrot)
>> +           udelay(100);
>> +   }
> 
> The SCSI core does not allow to sleep inside the queuecommand() callback
> function.

udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a great idea. We want to fix the reordering due to requeues, not
introduce random busy delays to work around it.
Kashyap Desai Jan. 30, 2017, 6:28 p.m. UTC | #3
> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Monday, January 30, 2017 10:03 PM
> To: Bart Van Assche; osandov@osandov.com; kashyap.desai@broadcom.com
> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org;
> hch@infradead.org; linux-block@vger.kernel.org; paolo.valente@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> >> -   if (atomic_inc_return(&instance->fw_outstanding) >
> >> -           instance->host->can_queue) {
> >> -       atomic_dec(&instance->fw_outstanding);
> >> -       return SCSI_MLQUEUE_HOST_BUSY;
> >> -   }
> >> +   if (atomic_inc_return(&instance->fw_outstanding) >
safe_can_queue) {
> >> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> >> +       /* For rotational device wait for sometime to get fusion
> >> + command
> >> from pool.
> >> +        * This is just to reduce proactive re-queue at mid layer
> >> + which is
> >> not
> >> +        * sending sorted IO in SCSI.MQ mode.
> >> +        */
> >> +       if (!is_nonrot)
> >> +           udelay(100);
> >> +   }
> >
> > The SCSI core does not allow to sleep inside the queuecommand()
> > callback function.
>
> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a
> great idea. We want to fix the reordering due to requeues, not introduce
> random busy delays to work around it.

Thanks for feedback. I do realize that udelay() is going to be very odd
in queue_command call back.   I will keep this note. Preferred solution is
blk mq scheduler patches.
>
> --
> Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Jan. 30, 2017, 6:29 p.m. UTC | #4
On 01/30/2017 11:28 AM, Kashyap Desai wrote:
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Monday, January 30, 2017 10:03 PM
>> To: Bart Van Assche; osandov@osandov.com; kashyap.desai@broadcom.com
>> Cc: linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org;
>> hch@infradead.org; linux-block@vger.kernel.org; paolo.valente@linaro.org
>> Subject: Re: Device or HBA level QD throttling creates randomness in
>> sequetial workload
>>
>> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
>>> On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
>>>> -   if (atomic_inc_return(&instance->fw_outstanding) >
>>>> -           instance->host->can_queue) {
>>>> -       atomic_dec(&instance->fw_outstanding);
>>>> -       return SCSI_MLQUEUE_HOST_BUSY;
>>>> -   }
>>>> +   if (atomic_inc_return(&instance->fw_outstanding) >
> safe_can_queue) {
>>>> +       is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
>>>> +       /* For rotational device wait for sometime to get fusion
>>>> + command
>>>> from pool.
>>>> +        * This is just to reduce proactive re-queue at mid layer
>>>> + which is
>>>> not
>>>> +        * sending sorted IO in SCSI.MQ mode.
>>>> +        */
>>>> +       if (!is_nonrot)
>>>> +           udelay(100);
>>>> +   }
>>>
>>> The SCSI core does not allow to sleep inside the queuecommand()
>>> callback function.
>>
>> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
> NOT a
>> great idea. We want to fix the reordering due to requeues, not introduce
>> random busy delays to work around it.
> 
> Thanks for feedback. I do realize that udelay() is going to be very odd
> in queue_command call back.   I will keep this note. Preferred solution is
> blk mq scheduler patches.

It's coming in 4.11, so you don't have to wait long.
diff mbox

Patch

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 9a9c84f..a683eb0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -54,6 +54,7 @@ 
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_dbg.h>
 #include <linux/dmi.h>
+#include <linux/cpumask.h>

 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
@@ -2572,7 +2573,15 @@  void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
    struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
    union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
    u32 index;
-   struct fusion_context *fusion;
+   bool    is_nonrot;
+   u32 safe_can_queue;
+   u32 num_cpus;
+   struct fusion_context *fusion;
+
+   fusion = instance->ctrl_context;
+
+   num_cpus = num_online_cpus();
+   safe_can_queue = instance->cur_can_queue - num_cpus;

    fusion = instance->ctrl_context;