[RFC,0/2] blk-mq I/O scheduling fixes

Message ID	20190919094547.67194-1-hare@suse.de (mailing list archive)
Headers	show Return-Path: <SRS0=Rcc/=XO=vger.kernel.org=linux-block-owner@kernel.org> From: Hannes Reinecke <hare@suse.de> To: Jens Axboe <axboe@kernel.dk> Cc: linux-scsi@vger.kernel.org, "Martin K. Petersen" <martin.petersen@oracle.com>, James Bottomley <james.bottomley@hansenpartnership.com>, Christoph Hellwig <hch@lst.de>, linux-block@vger.kernel.org, Hans Holmberg <hans.holmberg@wdc.com>, Damien Le Moal <damien.lemoal@wdc.com>, Hannes Reinecke <hare@suse.de> Subject: [RFC PATCH 0/2] blk-mq I/O scheduling fixes Date: Thu, 19 Sep 2019 11:45:45 +0200 Message-Id: <20190919094547.67194-1-hare@suse.de> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	blk-mq I/O scheduling fixes \| expand [RFC,0/2] blk-mq I/O scheduling fixes [1/2] blk-mq: fixup request re-insert in blk_mq_try_issue_list_directly() [2/2] blk-mq: always call into the scheduler in blk_mq_make_request()

Hannes Reinecke Sept. 19, 2019, 9:45 a.m. UTC

Hi all,

Damien pointed out that there are some areas in the blk-mq I/O
scheduling algorithm which have a distinct legacy feel to it,
and prohibit multiqueue I/O schedulers from working properly.
These two patches should clear up this situation, but as it's
not quite clear what the original intention of the code was
I'll be posting them as an RFC.

So as usual, comments and reviews are welcome.

Hannes Reinecke (2):
  blk-mq: fixup request re-insert in blk_mq_try_issue_list_directly()
  blk-mq: always call into the scheduler in blk_mq_make_request()

 block/blk-mq.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

Liu, Sunny Sept. 19, 2019, 9:56 a.m. UTC | #1

Hello Sir,

I have a question about the I/O scheduler in kernel 5.2.9

in the new kernel, which I/O scheduler should be used by legacy rotating drive? Such as sata HDD?
During FIO testing with libaio, I had create multiple thread in the testing, and then found 512k and bigger sequent write had bad performance result. Even I had enable and use BFQ scheduler.

There has no sq scheduler anymore, only has none, mq-deadline, kyber and BFQ.
Mq-deadline and kyber is for fast block device. Only the BFQ looks better performance, but it can't keep the good behavior during 512k or bigger 100% seq write.

Could you give me some advices what parameter should I change for multiple thread bigger file seq writing?

Thanks all of you.

BestRegards,
SunnyLiu(刘萍)
LenovoNetApp 
北京市海淀区西北旺东路10号院2号楼L3-E1-01
L3-E1-01,Building No.2, Lenovo HQ West No.10 XiBeiWang East Rd.,
Haidian District, Beijing 100094, PRC
Tel: +86 15910622368

-----Original Message-----
From: linux-block-owner@vger.kernel.org <linux-block-owner@vger.kernel.org> On Behalf Of Hannes Reinecke
Sent: 2019年9月19日 17:46
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-scsi@vger.kernel.org; Martin K. Petersen <martin.petersen@oracle.com>; James Bottomley <james.bottomley@hansenpartnership.com>; Christoph Hellwig <hch@lst.de>; linux-block@vger.kernel.org; Hans Holmberg <hans.holmberg@wdc.com>; Damien Le Moal <damien.lemoal@wdc.com>; Hannes Reinecke <hare@suse.de>
Subject: [RFC PATCH 0/2] blk-mq I/O scheduling fixes

Hi all,

Damien pointed out that there are some areas in the blk-mq I/O scheduling algorithm which have a distinct legacy feel to it, and prohibit multiqueue I/O schedulers from working properly.
These two patches should clear up this situation, but as it's not quite clear what the original intention of the code was I'll be posting them as an RFC.

So as usual, comments and reviews are welcome.

Hannes Reinecke (2):
  blk-mq: fixup request re-insert in blk_mq_try_issue_list_directly()
  blk-mq: always call into the scheduler in blk_mq_make_request()

 block/blk-mq.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--
2.16.4

Damien Le Moal Sept. 19, 2019, 10:03 a.m. UTC | #2

On 2019/09/19 11:57, Liu, Sunny wrote:
> Hello Sir,
> 
> I have a question about the I/O scheduler in kernel 5.2.9
> 
> in the new kernel, which I/O scheduler should be used by legacy rotating
> drive? Such as sata HDD? During FIO testing with libaio, I had create
> multiple thread in the testing, and then found 512k and bigger sequent write
> had bad performance result. Even I had enable and use BFQ scheduler.
> 
> There has no sq scheduler anymore, only has none, mq-deadline, kyber and
> BFQ. Mq-deadline and kyber is for fast block device. Only the BFQ looks
> better performance, but it can't keep the good behavior during 512k or bigger
> 100% seq write.
> 
> Could you give me some advices what parameter should I change for multiple
> thread bigger file seq writing?

The default block IO scheduler for a single queue device (e.g. HDDs in most
cases, but beware of the HBA being used and how it exposes the disk) is
mq-deadline. For a multiqueue device (e.g. NVMe SSDs), the default elevator is none.

For your SATA SSD, which is a single queue device, the default elevator will be
mq-deadline. This elevator should give you very good performance. "none" will
probably also give you the same results though.

Performance on SSD highly depends on the SSD condition (the amount and pattern
of writes preceding the test). You may want to trim the entire device before
writing it to check the maximum  performance you can get out of it.

> 
> Thanks all of you.
> 
> BestRegards, SunnyLiu(刘萍) LenovoNetApp 北京市海淀区西北旺东路10号院2号楼L3-E1-01 
> L3-E1-01,Building No.2, Lenovo HQ West No.10 XiBeiWang East Rd., Haidian
> District, Beijing 100094, PRC Tel: +86 15910622368
> 
> -----Original Message----- From: linux-block-owner@vger.kernel.org
> <linux-block-owner@vger.kernel.org> On Behalf Of Hannes Reinecke Sent: 2019年9
> 月19日 17:46 To: Jens Axboe <axboe@kernel.dk> Cc: linux-scsi@vger.kernel.org;
> Martin K. Petersen <martin.petersen@oracle.com>; James Bottomley
> <james.bottomley@hansenpartnership.com>; Christoph Hellwig <hch@lst.de>;
> linux-block@vger.kernel.org; Hans Holmberg <hans.holmberg@wdc.com>; Damien Le
> Moal <damien.lemoal@wdc.com>; Hannes Reinecke <hare@suse.de> Subject: [RFC
> PATCH 0/2] blk-mq I/O scheduling fixes
> 
> Hi all,
> 
> Damien pointed out that there are some areas in the blk-mq I/O scheduling
> algorithm which have a distinct legacy feel to it, and prohibit multiqueue
> I/O schedulers from working properly. These two patches should clear up this
> situation, but as it's not quite clear what the original intention of the
> code was I'll be posting them as an RFC.
> 
> So as usual, comments and reviews are welcome.
> 
> Hannes Reinecke (2): blk-mq: fixup request re-insert in
> blk_mq_try_issue_list_directly() blk-mq: always call into the scheduler in
> blk_mq_make_request()
> 
> block/blk-mq.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
> 
> -- 2.16.4
> 
>

Damien Le Moal Sept. 19, 2019, 12:44 p.m. UTC | #3

On 2019/09/19 12:59, Liu, Sunny wrote:
> Thank very much for your quickly advice.
> 
> The problem drive is sata HDD 7200rpm in raid 5.

Sorry, I read "SDD" where you had written "HDD" :)
Is this a hardware RAID ? Or is this using dm/md raid ?

> If using Fio libaio iodepth=128 numjob=2, the bad performance will be as below
> in red. But there is no problem with numjob=1. In our solution, *multiple
> threads* should be used.

Your data does not have the numjobs=1 case for kernel 5.2.9. You should run that
for comparison with the numjobs=2 case on the same kernel.

> From the testing result, BFQ low-latency had good performance, but it still has
> problem in 1m seq write.
> 
> The data is come from centos 7.6 (kernel 3.10.0-975) and kernel 5.2.9 with BFQ
> and bcache enabled. No bcache configure.
> 
> Is there any parameter can solve the 1m and upper seq write problem with
> multiple threads?

Not sure what the problem is here. You could look at a blktrace of each case to
see if there is any major difference in the command patterns sent to the disks
of your array, in particular command size.

Liu, Sunny Sept. 19, 2019, 12:54 p.m. UTC | #4

Sir,

The HDD is hardware Raid 5 with 530-8i raid card.
I tried 1m seq write with numjobs=1, the data similar as kernel 3.10.0, whatever mq-deadline or BFQ elevator.
If you need detail testing data with numjobs=1, I can do it. Or any info you need, such as two process with 1 thread.

Thank you.

BestRegards,
SunnyLiu(刘萍)
LenovoNetApp 
北京市海淀区西北旺东路10号院2号楼L3-E1-01
L3-E1-01,Building No.2, Lenovo HQ West No.10 XiBeiWang East Rd.,
Haidian District, Beijing 100094, PRC
Tel: +86 15910622368

-----Original Message-----
From: linux-block-owner@vger.kernel.org <linux-block-owner@vger.kernel.org> On Behalf Of Damien Le Moal
Sent: 2019年9月19日 20:45
To: Liu, Sunny <ping.liu@lenovonetapp.com>; Hannes Reinecke <hare@suse.de>; Jens Axboe <axboe@kernel.dk>
Cc: linux-scsi@vger.kernel.org; Martin K. Petersen <martin.petersen@oracle.com>; James Bottomley <james.bottomley@hansenpartnership.com>; Christoph Hellwig <hch@lst.de>; linux-block@vger.kernel.org; Hans Holmberg <Hans.Holmberg@wdc.com>
Subject: Re: [RFC PATCH 0/2] blk-mq I/O scheduling fixes

On 2019/09/19 12:59, Liu, Sunny wrote:
> Thank very much for your quickly advice.
> 
> The problem drive is sata HDD 7200rpm in raid 5.

Sorry, I read "SDD" where you had written "HDD" :) Is this a hardware RAID ? Or is this using dm/md raid ?

> If using Fio libaio iodepth=128 numjob=2, the bad performance will be 
> as below in red. But there is no problem with numjob=1. In our 
> solution, *multiple
> threads* should be used.

Your data does not have the numjobs=1 case for kernel 5.2.9. You should run that for comparison with the numjobs=2 case on the same kernel.

> From the testing result, BFQ low-latency had good performance, but it 
> still has problem in 1m seq write.
> 
> The data is come from centos 7.6 (kernel 3.10.0-975) and kernel 5.2.9 
> with BFQ and bcache enabled. No bcache configure.
> 
> Is there any parameter can solve the 1m and upper seq write problem 
> with multiple threads?

Not sure what the problem is here. You could look at a blktrace of each case to see if there is any major difference in the command patterns sent to the disks of your array, in particular command size.

--
Damien Le Moal
Western Digital Research

Hans Holmberg Sept. 19, 2019, 12:57 p.m. UTC | #5

On 2019-09-19 11:45, Hannes Reinecke wrote:
> Hi all,
> 
> Damien pointed out that there are some areas in the blk-mq I/O
> scheduling algorithm which have a distinct legacy feel to it,
> and prohibit multiqueue I/O schedulers from working properly.
> These two patches should clear up this situation, but as it's
> not quite clear what the original intention of the code was
> I'll be posting them as an RFC.
> 
> So as usual, comments and reviews are welcome.
> 
> Hannes Reinecke (2):
>    blk-mq: fixup request re-insert in blk_mq_try_issue_list_directly()
>    blk-mq: always call into the scheduler in blk_mq_make_request()
> 
>   block/blk-mq.c | 9 ++-------
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 

I tested this patch set in qemu and confirmed that write locking for ZBD 
now works again.

The bypass of the scheduler(in case (q->nr_hw_queues > 1 && is_sync) is 
the culprit, and with this removed, we're good again for zoned block 
devices.

Cheers,
Hans

Jens Axboe Sept. 19, 2019, 5:48 p.m. UTC | #6

On 9/19/19 3:45 AM, Hannes Reinecke wrote:
> Hi all,
> 
> Damien pointed out that there are some areas in the blk-mq I/O
> scheduling algorithm which have a distinct legacy feel to it,
> and prohibit multiqueue I/O schedulers from working properly.
> These two patches should clear up this situation, but as it's
> not quite clear what the original intention of the code was
> I'll be posting them as an RFC.
> 
> So as usual, comments and reviews are welcome.
> 
> Hannes Reinecke (2):
>    blk-mq: fixup request re-insert in blk_mq_try_issue_list_directly()
>    blk-mq: always call into the scheduler in blk_mq_make_request()
> 
>   block/blk-mq.c | 9 ++-------
>   1 file changed, 2 insertions(+), 7 deletions(-)

Not quite sure what to do with this... Did you test them at all?
One is obviously broken and would crash the kernel, the other
is/was a performance optimization done not that long ago.

Just going to ignore this series for now.

Damien Le Moal Sept. 19, 2019, 9:11 p.m. UTC | #7

On 2019/09/19 19:48, Jens Axboe wrote:
> On 9/19/19 3:45 AM, Hannes Reinecke wrote:
>> Hi all,
>>
>> Damien pointed out that there are some areas in the blk-mq I/O
>> scheduling algorithm which have a distinct legacy feel to it,
>> and prohibit multiqueue I/O schedulers from working properly.
>> These two patches should clear up this situation, but as it's
>> not quite clear what the original intention of the code was
>> I'll be posting them as an RFC.
>>
>> So as usual, comments and reviews are welcome.
>>
>> Hannes Reinecke (2):
>>    blk-mq: fixup request re-insert in blk_mq_try_issue_list_directly()
>>    blk-mq: always call into the scheduler in blk_mq_make_request()
>>
>>   block/blk-mq.c | 9 ++-------
>>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> Not quite sure what to do with this... Did you test them at all?

Yes, Hans tested but on one device type only and the bug in patch 1 went
undetected with the test case. Patch 2 does solve our specific problem which is
that sync write were bypassing the elevator (mq-deadline), causing unaligned
write errors with a multi-queue zoned device.

> One is obviously broken and would crash the kernel, the other
> is/was a performance optimization done not that long ago.
> 
> Just going to ignore this series for now.

Yes, please do. This was hacked quickly with Hannes yesterday and Hannes sent
this as an RFC. We now got plenty of comments (thanks to all who provided
feedback !) and will work on a proper patch series backed by more testing.

Best regards.

[RFC,0/2] blk-mq I/O scheduling fixes

Message

Comments