diff mbox series

[v2] block: BFQ default for single queue devices

Message ID 20181015141059.26579-1-linus.walleij@linaro.org (mailing list archive)
State New, archived
Headers show
Series [v2] block: BFQ default for single queue devices | expand

Commit Message

Linus Walleij Oct. 15, 2018, 2:10 p.m. UTC
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.

I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.

We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.

I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.

My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Mark Brown <broonie@kernel.org>
Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
ChangeLog v1->v2:
- Add a quirk so that devices with zoned writes are forced
  to use the deadline scheduler, this is necessary since only
  that scheduler supports zoned writes.
- There is a summary article in LWN for subscribers:
  https://lwn.net/Articles/767987/
---
 block/elevator.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

Comments

Paolo Valente Oct. 15, 2018, 2:22 p.m. UTC | #1
> Il giorno 15 ott 2018, alle ore 16:10, Linus Walleij <linus.walleij@linaro.org> ha scritto:
> 
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but also UBI and the loopback
> device.
> 
> I have been running it for a while without any negative
> effects on my pet systems and I want some wider testing
> so let's throw it out there and see what people say.
> Admittedly my use cases are limited. I need to keep this
> patch around for my personal needs anyway.
> 
> We take special care to avoid using BFQ on zoned devices
> (in particular SMR, shingled magnetic recording devices)
> as these currently require mq-deadline to group writes
> together.
> 
> I have opted against introducing any default scheduler
> through Kconfig as the mq-deadline enforcement for
> zoned devices has to be done at runtime anyways and
> too many config options will make things confusing.
> 
> My argument for setting a default policy in the kernel
> as opposed to user space is the "reasonable defaults"
> type, analogous to how we have one default CPU scheduling
> policy (CFS) that make most sense for most tasks, and
> how automatic process group scheduling happens in most
> distributions without userspace involvement. The BFQ
> scheduling policy makes most sense for single hardware
> queue devices and many embedded systems will not have
> the clever userspace tools (such as udev) to make an
> educated choice of scheduling policy. Defaults should be
> those that make most sense for the hardware.
> 
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Paolo Valente <paolo.valente@linaro.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Artem Bityutskiy <dedekind1@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Unless someone reports (hopefully reproducible) regressions with
common single-queue hardware, then
Acked-by: Paolo Valente <paolo.valente@linaro.org>

Thanks,
Paolo

> ---
> ChangeLog v1->v2:
> - Add a quirk so that devices with zoned writes are forced
>  to use the deadline scheduler, this is necessary since only
>  that scheduler supports zoned writes.
> - There is a summary article in LWN for subscribers:
>  https://lwn.net/Articles/767987/
> ---
> block/elevator.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 8fdcd64ae12e..6e6048ca3471 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
> }
> 
> /*
> - * For blk-mq devices, we default to using mq-deadline, if available, for single
> - * queue devices.  If deadline isn't available OR we have multiple queues,
> - * default to "none".
> + * For blk-mq devices, we default to using:
> + * - "none" for multiqueue devices (nr_hw_queues != 1)
> + * - "bfq", if available, for single queue devices
> + * - "mq-deadline" if "bfq" is not available for single queue devices
> + * - "none" for single queue devices as well as last resort
>  */
> int elevator_init_mq(struct request_queue *q)
> {
> 	struct elevator_type *e;
> +	const char *policy;
> 	int err = 0;
> 
> 	if (q->nr_hw_queues != 1)
> @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
> 	if (unlikely(q->elevator))
> 		goto out_unlock;
> 
> -	e = elevator_get(q, "mq-deadline", false);
> +	/*
> +	 * Zoned devices must use a deadline scheduler because currently
> +	 * that is the only scheduler respecting zoned writes.
> +	 */
> +	if (blk_queue_is_zoned(q))
> +		policy = "mq-deadline";
> +	else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
> +		policy = "bfq";
> +	else
> +		policy = "mq-deadline";
> +
> +	e = elevator_get(q, policy, false);
> 	if (!e)
> 		goto out_unlock;
> 
> -- 
> 2.17.2
>
Oleksandr Natalenko Oct. 15, 2018, 2:32 p.m. UTC | #2
Hi.

On 15.10.2018 16:10, Linus Walleij wrote:
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but also UBI and the loopback
> device.
> 
> I have been running it for a while without any negative
> effects on my pet systems and I want some wider testing
> so let's throw it out there and see what people say.
> Admittedly my use cases are limited. I need to keep this
> patch around for my personal needs anyway.
> 
> We take special care to avoid using BFQ on zoned devices
> (in particular SMR, shingled magnetic recording devices)
> as these currently require mq-deadline to group writes
> together.
> 
> I have opted against introducing any default scheduler
> through Kconfig as the mq-deadline enforcement for
> zoned devices has to be done at runtime anyways and
> too many config options will make things confusing.
> 
> My argument for setting a default policy in the kernel
> as opposed to user space is the "reasonable defaults"
> type, analogous to how we have one default CPU scheduling
> policy (CFS) that make most sense for most tasks, and
> how automatic process group scheduling happens in most
> distributions without userspace involvement. The BFQ
> scheduling policy makes most sense for single hardware
> queue devices and many embedded systems will not have
> the clever userspace tools (such as udev) to make an
> educated choice of scheduling policy. Defaults should be
> those that make most sense for the hardware.
> 
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Paolo Valente <paolo.valente@linaro.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Artem Bityutskiy <dedekind1@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
> ChangeLog v1->v2:
> - Add a quirk so that devices with zoned writes are forced
>   to use the deadline scheduler, this is necessary since only
>   that scheduler supports zoned writes.
> - There is a summary article in LWN for subscribers:
>   https://lwn.net/Articles/767987/
> ---
>  block/elevator.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 8fdcd64ae12e..6e6048ca3471 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
>  }
> 
>  /*
> - * For blk-mq devices, we default to using mq-deadline, if available,
> for single
> - * queue devices.  If deadline isn't available OR we have multiple 
> queues,
> - * default to "none".
> + * For blk-mq devices, we default to using:
> + * - "none" for multiqueue devices (nr_hw_queues != 1)
> + * - "bfq", if available, for single queue devices
> + * - "mq-deadline" if "bfq" is not available for single queue devices
> + * - "none" for single queue devices as well as last resort
>   */
>  int elevator_init_mq(struct request_queue *q)
>  {
>  	struct elevator_type *e;
> +	const char *policy;
>  	int err = 0;
> 
>  	if (q->nr_hw_queues != 1)
> @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
>  	if (unlikely(q->elevator))
>  		goto out_unlock;
> 
> -	e = elevator_get(q, "mq-deadline", false);
> +	/*
> +	 * Zoned devices must use a deadline scheduler because currently
> +	 * that is the only scheduler respecting zoned writes.
> +	 */
> +	if (blk_queue_is_zoned(q))
> +		policy = "mq-deadline";
> +	else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
> +		policy = "bfq";
> +	else
> +		policy = "mq-deadline";

If more rules will be needed in the future, shall we just add extra ifs, 
or it would be better to craft some struct/table now + policy search 
helper?

> +
> +	e = elevator_get(q, policy, false);
>  	if (!e)
>  		goto out_unlock;
Bart Van Assche Oct. 15, 2018, 3:02 p.m. UTC | #3
On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote:
> + * For blk-mq devices, we default to using:
> + * - "none" for multiqueue devices (nr_hw_queues != 1)
> + * - "bfq", if available, for single queue devices
> + * - "mq-deadline" if "bfq" is not available for single queue devices
> + * - "none" for single queue devices as well as last resort

For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch. 

Thanks,

Bart.
Jens Axboe Oct. 15, 2018, 3:39 p.m. UTC | #4
On 10/15/18 8:10 AM, Linus Walleij wrote:
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but also UBI and the loopback
> device.
> 
> I have been running it for a while without any negative
> effects on my pet systems and I want some wider testing
> so let's throw it out there and see what people say.
> Admittedly my use cases are limited. I need to keep this
> patch around for my personal needs anyway.
> 
> We take special care to avoid using BFQ on zoned devices
> (in particular SMR, shingled magnetic recording devices)
> as these currently require mq-deadline to group writes
> together.
> 
> I have opted against introducing any default scheduler
> through Kconfig as the mq-deadline enforcement for
> zoned devices has to be done at runtime anyways and
> too many config options will make things confusing.
> 
> My argument for setting a default policy in the kernel
> as opposed to user space is the "reasonable defaults"
> type, analogous to how we have one default CPU scheduling
> policy (CFS) that make most sense for most tasks, and
> how automatic process group scheduling happens in most
> distributions without userspace involvement. The BFQ
> scheduling policy makes most sense for single hardware
> queue devices and many embedded systems will not have
> the clever userspace tools (such as udev) to make an
> educated choice of scheduling policy. Defaults should be
> those that make most sense for the hardware.

I still don't like this. There are going to be tons of
cases where the single queue device is some hw raid setup
or similar, where performance is going to be much worse with
BFQ than it is with mq-deadline, for instance. That's just
one case.

This kind of policy does not belong in the kernel, at least
not in the current form. If we had some sort of "enable best
options for a desktop" then it could fall under that umbrella.
Paolo Valente Oct. 15, 2018, 6:26 p.m. UTC | #5
> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 10/15/18 8:10 AM, Linus Walleij wrote:
>> This sets BFQ as the default scheduler for single queue
>> block devices (nr_hw_queues == 1) if it is available. This
>> affects notably MMC/SD-cards but also UBI and the loopback
>> device.
>> 
>> I have been running it for a while without any negative
>> effects on my pet systems and I want some wider testing
>> so let's throw it out there and see what people say.
>> Admittedly my use cases are limited. I need to keep this
>> patch around for my personal needs anyway.
>> 
>> We take special care to avoid using BFQ on zoned devices
>> (in particular SMR, shingled magnetic recording devices)
>> as these currently require mq-deadline to group writes
>> together.
>> 
>> I have opted against introducing any default scheduler
>> through Kconfig as the mq-deadline enforcement for
>> zoned devices has to be done at runtime anyways and
>> too many config options will make things confusing.
>> 
>> My argument for setting a default policy in the kernel
>> as opposed to user space is the "reasonable defaults"
>> type, analogous to how we have one default CPU scheduling
>> policy (CFS) that make most sense for most tasks, and
>> how automatic process group scheduling happens in most
>> distributions without userspace involvement. The BFQ
>> scheduling policy makes most sense for single hardware
>> queue devices and many embedded systems will not have
>> the clever userspace tools (such as udev) to make an
>> educated choice of scheduling policy. Defaults should be
>> those that make most sense for the hardware.
> 
> I still don't like this. There are going to be tons of
> cases where the single queue device is some hw raid setup
> or similar, where performance is going to be much worse with
> BFQ than it is with mq-deadline, for instance. That's just
> one case.
> 

Hi Jens,
in my RAID tests bfq performed as well as in non-RAID tests.  Probably
you refer to the fact that, in a RAID configuration, IOPS can become
very high.  But, if that is the case, then the response to your
objections already emerged in the previous thread.  Let me sum it up
again.

I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS.  Then, through the public script I already
mentioned, I found the maximum number of IOPS that bfq can handle:
about 400K with a commodity CPU.

In particular, in all my tests with real hardware, bfq
- is not even comparable to that of any of the other scheduler, in
  terms of responsiveness, latency for real-time applications, ability
  to provide strong bandwidth guarantees, ability to boost throughput
  while guaranteeing bandwidths;
- is a little worse than the other scheduler for only one test, on
  only some hardware: total throughput with random reads, were it may
  lose up to 10-15% of throughput.  Of course, the scheduler that reach
  a higher throughput leave the machine unusable during the test.

So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.

Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware, for
one kind of test cases.

> This kind of policy does not belong in the kernel, at least
> not in the current form. If we had some sort of "enable best
> options for a desktop" then it could fall under that umbrella.
> 

I don't think bfq can be considered a scheduler for only desktops any
longer.

Thanks,
Paolo

> -- 
> Jens Axboe
Paolo Valente Oct. 15, 2018, 6:34 p.m. UTC | #6
> Il giorno 15 ott 2018, alle ore 17:02, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote:
>> + * For blk-mq devices, we default to using:
>> + * - "none" for multiqueue devices (nr_hw_queues != 1)
>> + * - "bfq", if available, for single queue devices
>> + * - "mq-deadline" if "bfq" is not available for single queue devices
>> + * - "none" for single queue devices as well as last resort
> 
> For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
> Since this patch is an attempt to improve performance, I'd like to see
> measurement data for one or more recent SATA SSDs before a decision is
> taken about what to do with this patch. 
> 

Hi Bart,
as I just wrote to Jens I don't think we need this test any longer.
To save you one hope, I'll paste my reply to Jens below.

Anyway, it is very easy to do the tests you ask:
- take a kernel containing the last bfq commits, such as for-next
- do, e.g.,
git clone https://github.com/Algodev-github/S.git
cd S/run_multiple_benchmarks
sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq none"
- compare results

Of course, do not do it for multi-queue devices or single-queues
devices, on steroids, that do 400-500 KIOPS.

I'll see if I can convince someone to repeat these tests with a recent
SSD.

And here is again my reply to Jens, which I think holds for your repeated
objection too.

I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS.  Then, through the public script I already
mentioned, I found the maximum number of IOPS that bfq can handle:
about 400K with a commodity CPU.

In particular, in all my tests with real hardware, bfq performance
- is not even comparable to that of any of the other scheduler, in
 terms of responsiveness, latency for real-time applications, ability
 to provide strong bandwidth guarantees, ability to boost throughput
 while guaranteeing bandwidths;
- is a little worse than the other schedulers for only one test, on
 only some hardware: total throughput with random reads, were it may
 lose up to 10-15% of throughput.  Of course, the schedulers that reach
 a higher throughput leave the machine unusable during the test.

So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.

Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware for
one kind of test cases.

Thanks,
Paolo

> Thanks,
> 
> Bart.
>
Jens Axboe Oct. 15, 2018, 7:26 p.m. UTC | #7
On 10/15/18 12:26 PM, Paolo Valente wrote:
> 
> 
>> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 10/15/18 8:10 AM, Linus Walleij wrote:
>>> This sets BFQ as the default scheduler for single queue
>>> block devices (nr_hw_queues == 1) if it is available. This
>>> affects notably MMC/SD-cards but also UBI and the loopback
>>> device.
>>>
>>> I have been running it for a while without any negative
>>> effects on my pet systems and I want some wider testing
>>> so let's throw it out there and see what people say.
>>> Admittedly my use cases are limited. I need to keep this
>>> patch around for my personal needs anyway.
>>>
>>> We take special care to avoid using BFQ on zoned devices
>>> (in particular SMR, shingled magnetic recording devices)
>>> as these currently require mq-deadline to group writes
>>> together.
>>>
>>> I have opted against introducing any default scheduler
>>> through Kconfig as the mq-deadline enforcement for
>>> zoned devices has to be done at runtime anyways and
>>> too many config options will make things confusing.
>>>
>>> My argument for setting a default policy in the kernel
>>> as opposed to user space is the "reasonable defaults"
>>> type, analogous to how we have one default CPU scheduling
>>> policy (CFS) that make most sense for most tasks, and
>>> how automatic process group scheduling happens in most
>>> distributions without userspace involvement. The BFQ
>>> scheduling policy makes most sense for single hardware
>>> queue devices and many embedded systems will not have
>>> the clever userspace tools (such as udev) to make an
>>> educated choice of scheduling policy. Defaults should be
>>> those that make most sense for the hardware.
>>
>> I still don't like this. There are going to be tons of
>> cases where the single queue device is some hw raid setup
>> or similar, where performance is going to be much worse with
>> BFQ than it is with mq-deadline, for instance. That's just
>> one case.
>>
> 
> Hi Jens,
> in my RAID tests bfq performed as well as in non-RAID tests.  Probably
> you refer to the fact that, in a RAID configuration, IOPS can become
> very high.  But, if that is the case, then the response to your
> objections already emerged in the previous thread.  Let me sum it up
> again.
> 
> I tested bfq on virtually every device in the range from few hundred
> of IOPS to 50-100KIOPS.  Then, through the public script I already
> mentioned, I found the maximum number of IOPS that bfq can handle:
> about 400K with a commodity CPU.
> 
> In particular, in all my tests with real hardware, bfq
> - is not even comparable to that of any of the other scheduler, in
>   terms of responsiveness, latency for real-time applications, ability
>   to provide strong bandwidth guarantees, ability to boost throughput
>   while guaranteeing bandwidths;
> - is a little worse than the other scheduler for only one test, on
>   only some hardware: total throughput with random reads, were it may
>   lose up to 10-15% of throughput.  Of course, the scheduler that reach
>   a higher throughput leave the machine unusable during the test.
> 
> So I really cannot see a reason why bfq could do worse than any of
> these other schedulers for some single-queue device (conservatively)
> below 300KIOPS.
> 
> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
> probably less than 1% of all the single-queue storage around (USB
> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
> are sacrificing 99% of the hardware, to help 1% of the hardware, for
> one kind of test cases.

I should have been more clear - I'm not worried about IOPS overhead,
I'm worried about scheduling decisions that lower performance on
(for instance) raid composed of many drives (rotational or otherwise).

If you have actual data (on what hardware, and what kind of tests)
to disprove that worry, then that's great, and I'd love to see that.
Paolo Valente Oct. 15, 2018, 7:44 p.m. UTC | #8
> Il giorno 15 ott 2018, alle ore 21:26, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 10/15/18 12:26 PM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto:
>>> 
>>> On 10/15/18 8:10 AM, Linus Walleij wrote:
>>>> This sets BFQ as the default scheduler for single queue
>>>> block devices (nr_hw_queues == 1) if it is available. This
>>>> affects notably MMC/SD-cards but also UBI and the loopback
>>>> device.
>>>> 
>>>> I have been running it for a while without any negative
>>>> effects on my pet systems and I want some wider testing
>>>> so let's throw it out there and see what people say.
>>>> Admittedly my use cases are limited. I need to keep this
>>>> patch around for my personal needs anyway.
>>>> 
>>>> We take special care to avoid using BFQ on zoned devices
>>>> (in particular SMR, shingled magnetic recording devices)
>>>> as these currently require mq-deadline to group writes
>>>> together.
>>>> 
>>>> I have opted against introducing any default scheduler
>>>> through Kconfig as the mq-deadline enforcement for
>>>> zoned devices has to be done at runtime anyways and
>>>> too many config options will make things confusing.
>>>> 
>>>> My argument for setting a default policy in the kernel
>>>> as opposed to user space is the "reasonable defaults"
>>>> type, analogous to how we have one default CPU scheduling
>>>> policy (CFS) that make most sense for most tasks, and
>>>> how automatic process group scheduling happens in most
>>>> distributions without userspace involvement. The BFQ
>>>> scheduling policy makes most sense for single hardware
>>>> queue devices and many embedded systems will not have
>>>> the clever userspace tools (such as udev) to make an
>>>> educated choice of scheduling policy. Defaults should be
>>>> those that make most sense for the hardware.
>>> 
>>> I still don't like this. There are going to be tons of
>>> cases where the single queue device is some hw raid setup
>>> or similar, where performance is going to be much worse with
>>> BFQ than it is with mq-deadline, for instance. That's just
>>> one case.
>>> 
>> 
>> Hi Jens,
>> in my RAID tests bfq performed as well as in non-RAID tests.  Probably
>> you refer to the fact that, in a RAID configuration, IOPS can become
>> very high.  But, if that is the case, then the response to your
>> objections already emerged in the previous thread.  Let me sum it up
>> again.
>> 
>> I tested bfq on virtually every device in the range from few hundred
>> of IOPS to 50-100KIOPS.  Then, through the public script I already
>> mentioned, I found the maximum number of IOPS that bfq can handle:
>> about 400K with a commodity CPU.
>> 
>> In particular, in all my tests with real hardware, bfq
>> - is not even comparable to that of any of the other scheduler, in
>>  terms of responsiveness, latency for real-time applications, ability
>>  to provide strong bandwidth guarantees, ability to boost throughput
>>  while guaranteeing bandwidths;
>> - is a little worse than the other scheduler for only one test, on
>>  only some hardware: total throughput with random reads, were it may
>>  lose up to 10-15% of throughput.  Of course, the scheduler that reach
>>  a higher throughput leave the machine unusable during the test.
>> 
>> So I really cannot see a reason why bfq could do worse than any of
>> these other schedulers for some single-queue device (conservatively)
>> below 300KIOPS.
>> 
>> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
>> probably less than 1% of all the single-queue storage around (USB
>> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
>> are sacrificing 99% of the hardware, to help 1% of the hardware, for
>> one kind of test cases.
> 
> I should have been more clear - I'm not worried about IOPS overhead,
> I'm worried about scheduling decisions that lower performance on
> (for instance) raid composed of many drives (rotational or otherwise).
> 
> If you have actual data (on what hardware, and what kind of tests)
> to disprove that worry, then that's great, and I'd love to see that.
> 

Here are some old results with a very simple configuration:
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/

Then I stopped repeating tests that always yielded the same good results.

As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries.  The involved box
had a RAID reaching a few Gbps, and everything worked well.

Anyway, if you have specific issues in mind, I can check more deeply.

Thanks,
Paolo

> 
> -- 
> Jens Axboe
Ulf Hansson Oct. 16, 2018, 1:42 p.m. UTC | #9
On 15 October 2018 at 16:10, Linus Walleij <linus.walleij@linaro.org> wrote:
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but also UBI and the loopback
> device.
>
> I have been running it for a while without any negative
> effects on my pet systems and I want some wider testing
> so let's throw it out there and see what people say.
> Admittedly my use cases are limited. I need to keep this
> patch around for my personal needs anyway.
>
> We take special care to avoid using BFQ on zoned devices
> (in particular SMR, shingled magnetic recording devices)
> as these currently require mq-deadline to group writes
> together.
>
> I have opted against introducing any default scheduler
> through Kconfig as the mq-deadline enforcement for
> zoned devices has to be done at runtime anyways and
> too many config options will make things confusing.
>
> My argument for setting a default policy in the kernel
> as opposed to user space is the "reasonable defaults"
> type, analogous to how we have one default CPU scheduling
> policy (CFS) that make most sense for most tasks, and
> how automatic process group scheduling happens in most
> distributions without userspace involvement. The BFQ
> scheduling policy makes most sense for single hardware
> queue devices and many embedded systems will not have
> the clever userspace tools (such as udev) to make an
> educated choice of scheduling policy. Defaults should be
> those that make most sense for the hardware.

As already stated for v1, this makes perfect sense to me, thanks for posting it!

I do understand there is some pushback from Bart and Jens, around how
to move this forward. However, let's hope they get convinced to try
this out.

When it comes to potential "performance" regressions, I am sure Paolo
is standing-by to help out with BFQ changes, if needed. Moreover, we
can always do a simple revert in worst case scenario, especially since
the change is really limited.

>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Paolo Valente <paolo.valente@linaro.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Artem Bityutskiy <dedekind1@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

So FWIW:

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Kind regards
Uffe

> ---
> ChangeLog v1->v2:
> - Add a quirk so that devices with zoned writes are forced
>   to use the deadline scheduler, this is necessary since only
>   that scheduler supports zoned writes.
> - There is a summary article in LWN for subscribers:
>   https://lwn.net/Articles/767987/
> ---
>  block/elevator.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/block/elevator.c b/block/elevator.c
> index 8fdcd64ae12e..6e6048ca3471 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
>  }
>
>  /*
> - * For blk-mq devices, we default to using mq-deadline, if available, for single
> - * queue devices.  If deadline isn't available OR we have multiple queues,
> - * default to "none".
> + * For blk-mq devices, we default to using:
> + * - "none" for multiqueue devices (nr_hw_queues != 1)
> + * - "bfq", if available, for single queue devices
> + * - "mq-deadline" if "bfq" is not available for single queue devices
> + * - "none" for single queue devices as well as last resort
>   */
>  int elevator_init_mq(struct request_queue *q)
>  {
>         struct elevator_type *e;
> +       const char *policy;
>         int err = 0;
>
>         if (q->nr_hw_queues != 1)
> @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
>         if (unlikely(q->elevator))
>                 goto out_unlock;
>
> -       e = elevator_get(q, "mq-deadline", false);
> +       /*
> +        * Zoned devices must use a deadline scheduler because currently
> +        * that is the only scheduler respecting zoned writes.
> +        */
> +       if (blk_queue_is_zoned(q))
> +               policy = "mq-deadline";
> +       else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
> +               policy = "bfq";
> +       else
> +               policy = "mq-deadline";
> +
> +       e = elevator_get(q, policy, false);
>         if (!e)
>                 goto out_unlock;
>
> --
> 2.17.2
>
Federico Motta Oct. 16, 2018, 4:14 p.m. UTC | #10
On 10/15/18 5:02 PM, Bart Van Assche wrote:
> On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote:
>> + * For blk-mq devices, we default to using:
>> + * - "none" for multiqueue devices (nr_hw_queues != 1)
>> + * - "bfq", if available, for single queue devices
>> + * - "mq-deadline" if "bfq" is not available for single queue devices
>> + * - "none" for single queue devices as well as last resort
> 
> For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
> Since this patch is an attempt to improve performance, I'd like to see
> measurement data for one or more recent SATA SSDs before a decision is
> taken about what to do with this patch. 
> 
> Thanks,
> 
> Bart.
> 

Hi,
although these tests should be run for single-queue devices, I tried to
run them on an NVMe high-performance device. Imho if results are good
in such a "difficult to deal with" multi-queue device, they should be
good enough also in a "simpler" single-queue storage device..

Testbed specs:
kernel = 4.18.0 (from bfq dev branch [1], where bfq already contains
                 also the commits that will be available from 4.20)
fs     = ext4
drive  = ssd samsung 960 pro NVMe m.2 512gb

Device data sheet specs state that under random IO:
* QD  1 thread 1
  * read  = 14 kIOPS
  * write = 50 kIOPS
* QD 32 thread 4
  * read = write = 330 kIOPS

What follows is a results summary; under requests I can give all
results. The workload notation (e.g. 5r5w-seq) means:
- num_readers                  (5r)
- num_writers                  (5w)
- sequential_io or random_io   (-seq)


# replayed gnome-terminal startup time (lower is better)
workload  bfq-mq [s]  none [s]  % gain
--------  ----------  --------  ------
 10r-seq    0.3725      2.79     86.65
5r5w-seq    0.9725      5.53     82.41

# throughput (higher is better)
workload   bfq-mq [mb/s]  none [mb/s]   % gain
---------  -------------  -----------  -------
 10r-rand       394.806      429.735    -8.128
 10r-seq       1387.63      1431.81     -3.086
  1r-seq        838.13       798.872     4.914
5r5w-rand      1118.12      1297.46    -13.822
5r5w-seq       1187         1313.8      -9.651

Thanks,
Federico

[1] https://github.com/Algodev-github/bfq-mq/commits/bfq-mq
Paolo Valente Oct. 16, 2018, 4:26 p.m. UTC | #11
> Il giorno 16 ott 2018, alle ore 18:14, Federico Motta <federico@willer.it> ha scritto:
> 
> On 10/15/18 5:02 PM, Bart Van Assche wrote:
>> On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote:
>>> + * For blk-mq devices, we default to using:
>>> + * - "none" for multiqueue devices (nr_hw_queues != 1)
>>> + * - "bfq", if available, for single queue devices
>>> + * - "mq-deadline" if "bfq" is not available for single queue devices
>>> + * - "none" for single queue devices as well as last resort
>> 
>> For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
>> Since this patch is an attempt to improve performance, I'd like to see
>> measurement data for one or more recent SATA SSDs before a decision is
>> taken about what to do with this patch. 
>> 
>> Thanks,
>> 
>> Bart.
>> 
> 
> Hi,
> although these tests should be run for single-queue devices, I tried to
> run them on an NVMe high-performance device. Imho if results are good
> in such a "difficult to deal with" multi-queue device, they should be
> good enough also in a "simpler" single-queue storage device..
> 
> Testbed specs:
> kernel = 4.18.0 (from bfq dev branch [1], where bfq already contains
>                 also the commits that will be available from 4.20)
> fs     = ext4
> drive  = ssd samsung 960 pro NVMe m.2 512gb
> 
> Device data sheet specs state that under random IO:
> * QD  1 thread 1
>  * read  = 14 kIOPS
>  * write = 50 kIOPS
> * QD 32 thread 4
>  * read = write = 330 kIOPS
> 
> What follows is a results summary; under requests I can give all
> results. The workload notation (e.g. 5r5w-seq) means:
> - num_readers                  (5r)
> - num_writers                  (5w)
> - sequential_io or random_io   (-seq)
> 
> 
> # replayed gnome-terminal startup time (lower is better)
> workload  bfq-mq [s]  none [s]  % gain
> --------  ----------  --------  ------
> 10r-seq    0.3725      2.79     86.65
> 5r5w-seq    0.9725      5.53     82.41
> 
> # throughput (higher is better)
> workload   bfq-mq [mb/s]  none [mb/s]   % gain
> ---------  -------------  -----------  -------
> 10r-rand       394.806      429.735    -8.128
> 10r-seq       1387.63      1431.81     -3.086
>  1r-seq        838.13       798.872     4.914
> 5r5w-rand      1118.12      1297.46    -13.822
> 5r5w-seq       1187         1313.8      -9.651
> 

A little unexpectedly for me, throughput loss for random I/O is even
lower than what I have obtained with my nasty SATA SSD (and reported
in my public results).

I didn't expect that little loss with sequential parallel reads.
Probably, when going multiqueue, there are changes I haven't even
thought about (I have never even tested bfq on a multi-queue device).

Thanks,
Paolo

> Thanks,
> Federico
> 
> [1] https://github.com/Algodev-github/bfq-mq/commits/bfq-mq
Jens Axboe Oct. 16, 2018, 5:35 p.m. UTC | #12
On 10/15/18 1:44 PM, Paolo Valente wrote:
> 
> 
>> Il giorno 15 ott 2018, alle ore 21:26, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 10/15/18 12:26 PM, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto:
>>>>
>>>> On 10/15/18 8:10 AM, Linus Walleij wrote:
>>>>> This sets BFQ as the default scheduler for single queue
>>>>> block devices (nr_hw_queues == 1) if it is available. This
>>>>> affects notably MMC/SD-cards but also UBI and the loopback
>>>>> device.
>>>>>
>>>>> I have been running it for a while without any negative
>>>>> effects on my pet systems and I want some wider testing
>>>>> so let's throw it out there and see what people say.
>>>>> Admittedly my use cases are limited. I need to keep this
>>>>> patch around for my personal needs anyway.
>>>>>
>>>>> We take special care to avoid using BFQ on zoned devices
>>>>> (in particular SMR, shingled magnetic recording devices)
>>>>> as these currently require mq-deadline to group writes
>>>>> together.
>>>>>
>>>>> I have opted against introducing any default scheduler
>>>>> through Kconfig as the mq-deadline enforcement for
>>>>> zoned devices has to be done at runtime anyways and
>>>>> too many config options will make things confusing.
>>>>>
>>>>> My argument for setting a default policy in the kernel
>>>>> as opposed to user space is the "reasonable defaults"
>>>>> type, analogous to how we have one default CPU scheduling
>>>>> policy (CFS) that make most sense for most tasks, and
>>>>> how automatic process group scheduling happens in most
>>>>> distributions without userspace involvement. The BFQ
>>>>> scheduling policy makes most sense for single hardware
>>>>> queue devices and many embedded systems will not have
>>>>> the clever userspace tools (such as udev) to make an
>>>>> educated choice of scheduling policy. Defaults should be
>>>>> those that make most sense for the hardware.
>>>>
>>>> I still don't like this. There are going to be tons of
>>>> cases where the single queue device is some hw raid setup
>>>> or similar, where performance is going to be much worse with
>>>> BFQ than it is with mq-deadline, for instance. That's just
>>>> one case.
>>>>
>>>
>>> Hi Jens,
>>> in my RAID tests bfq performed as well as in non-RAID tests.  Probably
>>> you refer to the fact that, in a RAID configuration, IOPS can become
>>> very high.  But, if that is the case, then the response to your
>>> objections already emerged in the previous thread.  Let me sum it up
>>> again.
>>>
>>> I tested bfq on virtually every device in the range from few hundred
>>> of IOPS to 50-100KIOPS.  Then, through the public script I already
>>> mentioned, I found the maximum number of IOPS that bfq can handle:
>>> about 400K with a commodity CPU.
>>>
>>> In particular, in all my tests with real hardware, bfq
>>> - is not even comparable to that of any of the other scheduler, in
>>>  terms of responsiveness, latency for real-time applications, ability
>>>  to provide strong bandwidth guarantees, ability to boost throughput
>>>  while guaranteeing bandwidths;
>>> - is a little worse than the other scheduler for only one test, on
>>>  only some hardware: total throughput with random reads, were it may
>>>  lose up to 10-15% of throughput.  Of course, the scheduler that reach
>>>  a higher throughput leave the machine unusable during the test.
>>>
>>> So I really cannot see a reason why bfq could do worse than any of
>>> these other schedulers for some single-queue device (conservatively)
>>> below 300KIOPS.
>>>
>>> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
>>> probably less than 1% of all the single-queue storage around (USB
>>> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
>>> are sacrificing 99% of the hardware, to help 1% of the hardware, for
>>> one kind of test cases.
>>
>> I should have been more clear - I'm not worried about IOPS overhead,
>> I'm worried about scheduling decisions that lower performance on
>> (for instance) raid composed of many drives (rotational or otherwise).
>>
>> If you have actual data (on what hardware, and what kind of tests)
>> to disprove that worry, then that's great, and I'd love to see that.
>>
> 
> Here are some old results with a very simple configuration:
> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
> 
> Then I stopped repeating tests that always yielded the same good results.
> 
> As for more professional systems, a well-known company doing
> real-time packet-traffic dumping asked me to modify bfq so as to
> guarantee lossless data writing also during queries.  The involved box
> had a RAID reaching a few Gbps, and everything worked well.
> 
> Anyway, if you have specific issues in mind, I can check more deeply.

Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.

I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
Paolo Valente Oct. 17, 2018, 5:18 a.m. UTC | #13
> Il giorno 15 ott 2018, alle ore 20:34, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 15 ott 2018, alle ore 17:02, Bart Van Assche <bvanassche@acm.org> ha scritto:
>> 
>> On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote:
>>> + * For blk-mq devices, we default to using:
>>> + * - "none" for multiqueue devices (nr_hw_queues != 1)
>>> + * - "bfq", if available, for single queue devices
>>> + * - "mq-deadline" if "bfq" is not available for single queue devices
>>> + * - "none" for single queue devices as well as last resort
>> 
>> For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
>> Since this patch is an attempt to improve performance, I'd like to see
>> measurement data for one or more recent SATA SSDs before a decision is
>> taken about what to do with this patch. 
>> 
> 
> Hi Bart,
> as I just wrote to Jens I don't think we need this test any longer.
> To save you one hope, I'll paste my reply to Jens below.
> 
> Anyway, it is very easy to do the tests you ask:
> - take a kernel containing the last bfq commits, such as for-next
> - do, e.g.,
> git clone https://github.com/Algodev-github/S.git
> cd S/run_multiple_benchmarks
> sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq none"
> - compare results
> 

Two things:

1) By mistake, I put 'none' in the last command line above, but it should be mq-deadline:

sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq mq-deadline"

2) If you are worried about wearing your device with writes, then just append 'raw' to the last command line. So:

sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq mq-deadline" raw

'raw' means: "don't even create files for the background traffic, but just read raw sectors".

Thanks,
Paolo

> Of course, do not do it for multi-queue devices or single-queues
> devices, on steroids, that do 400-500 KIOPS.
> 
> I'll see if I can convince someone to repeat these tests with a recent
> SSD.
> 
> And here is again my reply to Jens, which I think holds for your repeated
> objection too.
> 
> I tested bfq on virtually every device in the range from few hundred
> of IOPS to 50-100KIOPS.  Then, through the public script I already
> mentioned, I found the maximum number of IOPS that bfq can handle:
> about 400K with a commodity CPU.
> 
> In particular, in all my tests with real hardware, bfq performance
> - is not even comparable to that of any of the other scheduler, in
> terms of responsiveness, latency for real-time applications, ability
> to provide strong bandwidth guarantees, ability to boost throughput
> while guaranteeing bandwidths;
> - is a little worse than the other schedulers for only one test, on
> only some hardware: total throughput with random reads, were it may
> lose up to 10-15% of throughput.  Of course, the schedulers that reach
> a higher throughput leave the machine unusable during the test.
> 
> So I really cannot see a reason why bfq could do worse than any of
> these other schedulers for some single-queue device (conservatively)
> below 300KIOPS.
> 
> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
> probably less than 1% of all the single-queue storage around (USB
> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
> are sacrificing 99% of the hardware, to help 1% of the hardware for
> one kind of test cases.
> 
> Thanks,
> Paolo
> 
>> Thanks,
>> 
>> Bart.
>> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Jan Kara Oct. 17, 2018, 10:05 a.m. UTC | #14
On Tue 16-10-18 11:35:59, Jens Axboe wrote:
> On 10/15/18 1:44 PM, Paolo Valente wrote:
> > Here are some old results with a very simple configuration:
> > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
> > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
> > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
> > 
> > Then I stopped repeating tests that always yielded the same good results.
> > 
> > As for more professional systems, a well-known company doing
> > real-time packet-traffic dumping asked me to modify bfq so as to
> > guarantee lossless data writing also during queries.  The involved box
> > had a RAID reaching a few Gbps, and everything worked well.
> > 
> > Anyway, if you have specific issues in mind, I can check more deeply.
> 
> Do you have anything more recent? All of these predate the current
> code (by a lot), and isn't even mq. I'm mostly just interested in
> plain fast NVMe device, and a big box hardware raid setup with
> a ton of drives.
> 
> I do still think that this should be going through the distros, they
> need to be the ones driving this, as they will ultimately be the
> ones getting customer reports on regressions. The qual/test cycle
> they do is useful for this. In mainline, if we make a change like
> this, we'll figure out if it worked many releases down the line.

Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...

								Honza
Bart Van Assche Oct. 17, 2018, 2:48 p.m. UTC | #15
On 10/17/18 3:05 AM, Jan Kara wrote:
> Well, the problem with this is that big distro people really don't care
> much because they already use udev for tuning the IO scheduler. So whatever
> defaults the kernel is going to pick likely won't be seen by distro
> customers. Embedded people seem to be driving this effort because they
> either don't run udev or they feel not all their teams building new
> products have enough expertise to come up with a proper set of rules...

What's missing in this discussion is a definition of "embedded system". 
Is that a system like a streaming player for TV channels that neither 
has a keyboard nor a display or a system that can run multiple apps 
simultaneously like a smartphone? I think the difference matters because 
some embedded devices hardly do any background I/O nor load any 
executable code from storage after boot. So at least for some embedded 
devices the problem discussed in this e-mail thread does not exist.

Bart.
Bryan Gurney Oct. 17, 2018, 2:59 p.m. UTC | #16
On Wed, Oct 17, 2018 at 10:48 AM, Bart Van Assche <bvanassche@acm.org> wrote:
> On 10/17/18 3:05 AM, Jan Kara wrote:
>>
>> Well, the problem with this is that big distro people really don't care
>> much because they already use udev for tuning the IO scheduler. So
>> whatever
>> defaults the kernel is going to pick likely won't be seen by distro
>> customers. Embedded people seem to be driving this effort because they
>> either don't run udev or they feel not all their teams building new
>> products have enough expertise to come up with a proper set of rules...
>
>
> What's missing in this discussion is a definition of "embedded system". Is
> that a system like a streaming player for TV channels that neither has a
> keyboard nor a display or a system that can run multiple apps simultaneously
> like a smartphone? I think the difference matters because some embedded
> devices hardly do any background I/O nor load any executable code from
> storage after boot. So at least for some embedded devices the problem
> discussed in this e-mail thread does not exist.
>
> Bart.

There are high-performance embedded systems on the market (NAS, etc.).

I feel strongly about the prevention of users running into errors
because of an incorrect scheduler default, because I encountered that
situation three times in my testing with zoned block devices.  The
switch to SCSI_MQ would resolve that, since mq-deadline is the
default, but in my case, I was using Fedora 28, which disables
CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
default scheduler was cfq.

Hopefully there aren't any other cases where choosing the "wrong
default scheduler" leads to errors.  Ideally the default scheduler
choice should prevent any errors, leaving it up to the distros to
configure a default via other methods, to optimize for performance.


Thanks,

Bryan
Mark Brown Oct. 17, 2018, 4:01 p.m. UTC | #17
On Wed, Oct 17, 2018 at 07:48:33AM -0700, Bart Van Assche wrote:
> On 10/17/18 3:05 AM, Jan Kara wrote:

> > Well, the problem with this is that big distro people really don't care
> > much because they already use udev for tuning the IO scheduler. So whatever
> > defaults the kernel is going to pick likely won't be seen by distro
> > customers. Embedded people seem to be driving this effort because they
> > either don't run udev or they feel not all their teams building new
> > products have enough expertise to come up with a proper set of rules...

> What's missing in this discussion is a definition of "embedded system". Is
> that a system like a streaming player for TV channels that neither has a
> keyboard nor a display or a system that can run multiple apps simultaneously
> like a smartphone? I think the difference matters because some embedded
> devices hardly do any background I/O nor load any executable code from
> storage after boot. So at least for some embedded devices the problem
> discussed in this e-mail thread does not exist.

It's a combination of things - smartphones are definitely part of the
target audience but other things can be affected, I'd guess your
streaming TV player example can have issues if it's got local storage
and downloads things in the background for example.  There's definitely
systems that never really use storage once they're booted but there's
also things that move data around and/or have interactive apps.  Even
with some of the things that don't really use storage at runtime it can
be important to help cut down boot times.
Jens Axboe Oct. 17, 2018, 4:29 p.m. UTC | #18
On 10/17/18 4:05 AM, Jan Kara wrote:
> On Tue 16-10-18 11:35:59, Jens Axboe wrote:
>> On 10/15/18 1:44 PM, Paolo Valente wrote:
>>> Here are some old results with a very simple configuration:
>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
>>>
>>> Then I stopped repeating tests that always yielded the same good results.
>>>
>>> As for more professional systems, a well-known company doing
>>> real-time packet-traffic dumping asked me to modify bfq so as to
>>> guarantee lossless data writing also during queries.  The involved box
>>> had a RAID reaching a few Gbps, and everything worked well.
>>>
>>> Anyway, if you have specific issues in mind, I can check more deeply.
>>
>> Do you have anything more recent? All of these predate the current
>> code (by a lot), and isn't even mq. I'm mostly just interested in
>> plain fast NVMe device, and a big box hardware raid setup with
>> a ton of drives.
>>
>> I do still think that this should be going through the distros, they
>> need to be the ones driving this, as they will ultimately be the
>> ones getting customer reports on regressions. The qual/test cycle
>> they do is useful for this. In mainline, if we make a change like
>> this, we'll figure out if it worked many releases down the line.
> 
> Well, the problem with this is that big distro people really don't care
> much because they already use udev for tuning the IO scheduler. So whatever
> defaults the kernel is going to pick likely won't be seen by distro
> customers. Embedded people seem to be driving this effort because they
> either don't run udev or they feel not all their teams building new
> products have enough expertise to come up with a proper set of rules...

Which is also the approach that I've been advocating for here, instead
of a kernel patch...
Jan Kara Oct. 18, 2018, 7:21 a.m. UTC | #19
On Wed 17-10-18 10:29:22, Jens Axboe wrote:
> On 10/17/18 4:05 AM, Jan Kara wrote:
> > On Tue 16-10-18 11:35:59, Jens Axboe wrote:
> >> On 10/15/18 1:44 PM, Paolo Valente wrote:
> >>> Here are some old results with a very simple configuration:
> >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
> >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
> >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
> >>>
> >>> Then I stopped repeating tests that always yielded the same good results.
> >>>
> >>> As for more professional systems, a well-known company doing
> >>> real-time packet-traffic dumping asked me to modify bfq so as to
> >>> guarantee lossless data writing also during queries.  The involved box
> >>> had a RAID reaching a few Gbps, and everything worked well.
> >>>
> >>> Anyway, if you have specific issues in mind, I can check more deeply.
> >>
> >> Do you have anything more recent? All of these predate the current
> >> code (by a lot), and isn't even mq. I'm mostly just interested in
> >> plain fast NVMe device, and a big box hardware raid setup with
> >> a ton of drives.
> >>
> >> I do still think that this should be going through the distros, they
> >> need to be the ones driving this, as they will ultimately be the
> >> ones getting customer reports on regressions. The qual/test cycle
> >> they do is useful for this. In mainline, if we make a change like
> >> this, we'll figure out if it worked many releases down the line.
> > 
> > Well, the problem with this is that big distro people really don't care
> > much because they already use udev for tuning the IO scheduler. So whatever
> > defaults the kernel is going to pick likely won't be seen by distro
> > customers. Embedded people seem to be driving this effort because they
> > either don't run udev or they feel not all their teams building new
> > products have enough expertise to come up with a proper set of rules...
> 
> Which is also the approach that I've been advocating for here, instead
> of a kernel patch...

I know you've been advocating the use of udev for IO scheduler selection.
But do you want to force everybody to use udev? And for people who build
their own (usually small) systems, do you want to force them to think about
IO scheduler selection and writing appropriate rules? These are the
problems people were mentioning and I'm not sure what is your opinion on
this.

								Honza
Jens Axboe Oct. 18, 2018, 2:35 p.m. UTC | #20
On 10/18/18 1:21 AM, Jan Kara wrote:
> On Wed 17-10-18 10:29:22, Jens Axboe wrote:
>> On 10/17/18 4:05 AM, Jan Kara wrote:
>>> On Tue 16-10-18 11:35:59, Jens Axboe wrote:
>>>> On 10/15/18 1:44 PM, Paolo Valente wrote:
>>>>> Here are some old results with a very simple configuration:
>>>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
>>>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
>>>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
>>>>>
>>>>> Then I stopped repeating tests that always yielded the same good results.
>>>>>
>>>>> As for more professional systems, a well-known company doing
>>>>> real-time packet-traffic dumping asked me to modify bfq so as to
>>>>> guarantee lossless data writing also during queries.  The involved box
>>>>> had a RAID reaching a few Gbps, and everything worked well.
>>>>>
>>>>> Anyway, if you have specific issues in mind, I can check more deeply.
>>>>
>>>> Do you have anything more recent? All of these predate the current
>>>> code (by a lot), and isn't even mq. I'm mostly just interested in
>>>> plain fast NVMe device, and a big box hardware raid setup with
>>>> a ton of drives.
>>>>
>>>> I do still think that this should be going through the distros, they
>>>> need to be the ones driving this, as they will ultimately be the
>>>> ones getting customer reports on regressions. The qual/test cycle
>>>> they do is useful for this. In mainline, if we make a change like
>>>> this, we'll figure out if it worked many releases down the line.
>>>
>>> Well, the problem with this is that big distro people really don't care
>>> much because they already use udev for tuning the IO scheduler. So whatever
>>> defaults the kernel is going to pick likely won't be seen by distro
>>> customers. Embedded people seem to be driving this effort because they
>>> either don't run udev or they feel not all their teams building new
>>> products have enough expertise to come up with a proper set of rules...
>>
>> Which is also the approach that I've been advocating for here, instead
>> of a kernel patch...
> 
> I know you've been advocating the use of udev for IO scheduler selection.
> But do you want to force everybody to use udev? And for people who build
> their own (usually small) systems, do you want to force them to think about
> IO scheduler selection and writing appropriate rules? These are the
> problems people were mentioning and I'm not sure what is your opinion on
> this.

I don't want to force everybody to use udev, use whatever you like on
your platform. For most people that is udev, for embedded it's something
else. As you said, distros already do this via udev. When I've had to
do it on my systems, I've added a udev rule to do it.

My opinion is that the kernel makes various schedulers available.
Deciding which one to use is policy that should go into user space.
The default should be something that's solid and works, fancier
setups and tuning should be left to user space.
Pavel Machek Oct. 19, 2018, 8:22 a.m. UTC | #21
Hi!

> >> Which is also the approach that I've been advocating for here, instead
> >> of a kernel patch...
> > 
> > I know you've been advocating the use of udev for IO scheduler selection.
> > But do you want to force everybody to use udev? And for people who build
> > their own (usually small) systems, do you want to force them to think about
> > IO scheduler selection and writing appropriate rules? These are the
> > problems people were mentioning and I'm not sure what is your opinion on
> > this.
> 
> I don't want to force everybody to use udev, use whatever you like on
> your platform. For most people that is udev, for embedded it's something
> else. As you said, distros already do this via udev. When I've had to
> do it on my systems, I've added a udev rule to do it.

This is not really helpful.

So you want me and everyone else and everyone on embedded to mess with
udev? No, thanks.

There are people booting with init=/bin/bash, too, running fsck. Would
not it be nice to use reasonable schedulers there?

> My opinion is that the kernel makes various schedulers available.
> Deciding which one to use is policy that should go into user space.
> The default should be something that's solid and works, fancier
> setups and tuning should be left to user space.

Kernel should do reasonable thing by default, and it seems to be easy
in this case.

You keep repeating "but someone's super fast raid might get slowed
down". Those 5 people in the world probably already have their udev
rules.

Now, lets do the right thing by default for the rest of the world,
including you.

									Pavel
Linus Walleij Oct. 19, 2018, 8:33 a.m. UTC | #22
On Mon, Oct 15, 2018 at 4:32 PM Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> On 15.10.2018 16:10, Linus Walleij wrote:

> > +     /*
> > +      * Zoned devices must use a deadline scheduler because currently
> > +      * that is the only scheduler respecting zoned writes.
> > +      */
> > +     if (blk_queue_is_zoned(q))
> > +             policy = "mq-deadline";
> > +     else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
> > +             policy = "bfq";
> > +     else
> > +             policy = "mq-deadline";
>
> If more rules will be needed in the future, shall we just add extra ifs,
> or it would be better to craft some struct/table now + policy search
> helper?

Let's do it when it happens. Premature optimization is the root
of all evil ;)

Yours,
Linus Walleij
Linus Walleij Oct. 19, 2018, 8:42 a.m. UTC | #23
On Wed, Oct 17, 2018 at 4:59 PM Bryan Gurney <bgurney@redhat.com> wrote:

> I feel strongly about the prevention of users running into errors
> because of an incorrect scheduler default, because I encountered that
> situation three times in my testing with zoned block devices. The
> switch to SCSI_MQ would resolve that, since mq-deadline is the
> default, but in my case, I was using Fedora 28, which disables
> CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
> default scheduler was cfq.

I think we should make a patch to the kernel that makes it
impossible (even from sysfs) to choose a non-zone aware
scheduler for these devices.

It's another topic than $SUBJECT patch though. I take this
into account in this version.

Yours,
Linus Walleij
Oleksandr Natalenko Oct. 19, 2018, 9:26 a.m. UTC | #24
Hi.

On 19.10.2018 10:33, Linus Walleij wrote:
>> > +     /*
>> > +      * Zoned devices must use a deadline scheduler because currently
>> > +      * that is the only scheduler respecting zoned writes.
>> > +      */
>> > +     if (blk_queue_is_zoned(q))
>> > +             policy = "mq-deadline";
>> > +     else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
>> > +             policy = "bfq";
>> > +     else
>> > +             policy = "mq-deadline";
>> 
>> If more rules will be needed in the future, shall we just add extra 
>> ifs,
>> or it would be better to craft some struct/table now + policy search
>> helper?
> 
> Let's do it when it happens. Premature optimization is the root
> of all evil ;)

I'd say, this is a matter of code readability, not optimisations. I do 
not strongly object against current approach, though.
Paolo Valente Oct. 19, 2018, 10:59 a.m. UTC | #25
> Il giorno 15 ott 2018, alle ore 20:26, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> ...
>> This kind of policy does not belong in the kernel, at least
>> not in the current form. If we had some sort of "enable best
>> options for a desktop" then it could fall under that umbrella.
>> 
> 
> I don't think bfq can be considered a scheduler for only desktops any
> longer.
> 

Hi Jens,
this reply of mine went on bugging me, until I understood my mistake.

The fact that I consider bfq good also for servers *does not* imply
that having bfq in desktops is to be refused.

As for the option that you are hinting at, I also acknowledge that it
would be trivial for an admin/developer to know whether a given kernel
is meant for a desktop/personal system, while it is more difficult to
choose explicitly among the various I/O schedulers available.

So, I apologize for my shortsighted, initial reply, and ask you if can
elaborate a little more on this.  I'm willing to help, if I can.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Jens Axboe
Bryan Gurney Oct. 19, 2018, 1:36 p.m. UTC | #26
On Fri, Oct 19, 2018 at 4:42 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Wed, Oct 17, 2018 at 4:59 PM Bryan Gurney <bgurney@redhat.com> wrote:
>
> > I feel strongly about the prevention of users running into errors
> > because of an incorrect scheduler default, because I encountered that
> > situation three times in my testing with zoned block devices. The
> > switch to SCSI_MQ would resolve that, since mq-deadline is the
> > default, but in my case, I was using Fedora 28, which disables
> > CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
> > default scheduler was cfq.
>
> I think we should make a patch to the kernel that makes it
> impossible (even from sysfs) to choose a non-zone aware
> scheduler for these devices.
>
> It's another topic than $SUBJECT patch though. I take this
> into account in this version.
>

I like this idea.  I don't have enough experience to write this patch
myself, but I imagine something like adding "bool is_zoned_aware" to
"struct elevator_type", and setting that true only for the schedulers
that are currently zoned-device aware (which is currently deadline on
single queue, mq-deadline on blk-mq).


Thanks,

Bryan
Johannes Thumshirn Oct. 19, 2018, 1:44 p.m. UTC | #27
On 19/10/18 15:36, Bryan Gurney wrote:
> I like this idea.  I don't have enough experience to write this patch
> myself, but I imagine something like adding "bool is_zoned_aware" to
> "struct elevator_type", and setting that true only for the schedulers
> that are currently zoned-device aware (which is currently deadline on
> single queue, mq-deadline on blk-mq).

I don't think this is needed currently as a) Jens is working on getting
rid of the legacy path, which leaves us with mq-deadline only and Linus'
patch has:

+	if (blk_queue_is_zoned(q))
+		policy = "mq-deadline";

Which chooses mq-deadline on a zoned device.

So nothing to worry about here now.

All this only given Linus' patch actually gets merged.

Byte,
	Johannes
Bryan Gurney Oct. 19, 2018, 2:16 p.m. UTC | #28
On Fri, Oct 19, 2018 at 9:44 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On 19/10/18 15:36, Bryan Gurney wrote:
>> I like this idea.  I don't have enough experience to write this patch
>> myself, but I imagine something like adding "bool is_zoned_aware" to
>> "struct elevator_type", and setting that true only for the schedulers
>> that are currently zoned-device aware (which is currently deadline on
>> single queue, mq-deadline on blk-mq).
>
> I don't think this is needed currently as a) Jens is working on getting
> rid of the legacy path,

Once the legacy schedulers are gone, the default (prior to Linus'
proposed patch) will be mq-deadline, which is zoned-device-aware.  So
the default scheduler will be "safer" for zoned devices.

However, it will still be possible for users (or distro defaults) to
select a non-zoned-aware scheduler, such as "none", "kyber", or "bfq"
(prior to this patch).  So there would still be a window for users to
encounter the same problems I found when aborted commands start
occurring during otherwise normal filesystem or storage activity, by
drivers that are otherwise compliant with the handling characteristics
of zoned block devices.

> which leaves us with mq-deadline only and Linus'
> patch has:
>
> +       if (blk_queue_is_zoned(q))
> +               policy = "mq-deadline";
>
> Which chooses mq-deadline on a zoned device.
>
> So nothing to worry about here now.
>
> All this only given Linus' patch actually gets merged.

I hope it does get merged.  I keep forgetting to save my "zoned
devices use deadline" udev rule on my SMR drive test machine in
between reinstalls.


Thanks,

Bryan
Jens Axboe Oct. 22, 2018, 8:08 a.m. UTC | #29
On 10/19/18 2:22 AM, Pavel Machek wrote:
> Hi!
> 
>>>> Which is also the approach that I've been advocating for here, instead
>>>> of a kernel patch...
>>>
>>> I know you've been advocating the use of udev for IO scheduler selection.
>>> But do you want to force everybody to use udev? And for people who build
>>> their own (usually small) systems, do you want to force them to think about
>>> IO scheduler selection and writing appropriate rules? These are the
>>> problems people were mentioning and I'm not sure what is your opinion on
>>> this.
>>
>> I don't want to force everybody to use udev, use whatever you like on
>> your platform. For most people that is udev, for embedded it's something
>> else. As you said, distros already do this via udev. When I've had to
>> do it on my systems, I've added a udev rule to do it.
> 
> This is not really helpful.
> 
> So you want me and everyone else and everyone on embedded to mess with
> udev? No, thanks.

Did you read what I wrote?

> There are people booting with init=/bin/bash, too, running fsck. Would
> not it be nice to use reasonable schedulers there?

I can pretty much guarantee that fsck will run the same speed,
regardless of scheduler. And users generally don't care about
ultimate fairness on the device while running fsck...

If you (or someone else) doesn't want to use udev, use whatever
you want. You're doing something heavily customized at that
point anyway, surely this isn't a show stopper.

>> My opinion is that the kernel makes various schedulers available.
>> Deciding which one to use is policy that should go into user space.
>> The default should be something that's solid and works, fancier
>> setups and tuning should be left to user space.
> 
> Kernel should do reasonable thing by default, and it seems to be easy
> in this case.

I agree, we just differ on what we consider the reasonable choice to
be.
Jens Axboe Oct. 22, 2018, 8:12 a.m. UTC | #30
On 10/19/18 2:42 AM, Linus Walleij wrote:
> On Wed, Oct 17, 2018 at 4:59 PM Bryan Gurney <bgurney@redhat.com> wrote:
> 
>> I feel strongly about the prevention of users running into errors
>> because of an incorrect scheduler default, because I encountered that
>> situation three times in my testing with zoned block devices. The
>> switch to SCSI_MQ would resolve that, since mq-deadline is the
>> default, but in my case, I was using Fedora 28, which disables
>> CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
>> default scheduler was cfq.
> 
> I think we should make a patch to the kernel that makes it
> impossible (even from sysfs) to choose a non-zone aware
> scheduler for these devices.
> 
> It's another topic than $SUBJECT patch though. I take this
> into account in this version.

Yes I agree, and I'd be happy to take such a patch. The only matching we
do now is mq-sched for mq-device, and vice versa.  And that will be
going away in 4.21, when there are no more !mq devices that use
scheduling.

If your device is zoned, then you should not be able to switch to a
scheduler that doesn't have support for that. The right approach here
would be to add a capability flag to the IO schedulers.
Jens Axboe Oct. 22, 2018, 8:21 a.m. UTC | #31
On 10/19/18 4:59 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 15 ott 2018, alle ore 20:26, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>
>> ...
>>> This kind of policy does not belong in the kernel, at least
>>> not in the current form. If we had some sort of "enable best
>>> options for a desktop" then it could fall under that umbrella.
>>>
>>
>> I don't think bfq can be considered a scheduler for only desktops any
>> longer.
>>
> 
> Hi Jens,
> this reply of mine went on bugging me, until I understood my mistake.
> 
> The fact that I consider bfq good also for servers *does not* imply
> that having bfq in desktops is to be refused.
> 
> As for the option that you are hinting at, I also acknowledge that it
> would be trivial for an admin/developer to know whether a given kernel
> is meant for a desktop/personal system, while it is more difficult to
> choose explicitly among the various I/O schedulers available.
> 
> So, I apologize for my shortsighted, initial reply, and ask you if can
> elaborate a little more on this.  I'm willing to help, if I can.

I think I've written about this multiple times now, but for me it
really just boils down to sane default, and policy in the kernel.
BFQ is very complicated, about 10K lines of code. I'm not comfortable
making that the default right now - as I've mentioned in other
replies, I think something like that should be driven by the distros
as they will ultimately be the ones that usually get complaints
about behavioral changes that impact performance adversely. This isn't
just about running some benchmarks and calling it a day.

Maybe some day we can make it the default on mq for single queue
devices, but I just don't think we are there yet in terms of
coverage. 

While I don't work for a distro anymore, I do have my hands dirty
with a fairly substantial deployment at work. There we run mq-deadline
on single queue devices, and kyber on multiqueue capable devices.
Oleksandr Natalenko Nov. 2, 2018, 10:40 a.m. UTC | #32
Hi.

On 16.10.2018 19:35, Jens Axboe wrote:
> Do you have anything more recent? All of these predate the current
> code (by a lot), and isn't even mq. I'm mostly just interested in
> plain fast NVMe device, and a big box hardware raid setup with
> a ton of drives.
> 
> I do still think that this should be going through the distros, they
> need to be the ones driving this, as they will ultimately be the
> ones getting customer reports on regressions. The qual/test cycle
> they do is useful for this. In mainline, if we make a change like
> this, we'll figure out if it worked many releases down the line.

Some benchmarks here for a non-RAID setup obtained by S suite. This is 
from Lenovo T460s with SAMSUNG MZNTY256HDHP-000L7 SSD. v4.19 kernel is 
running with all recent BFQ patches applied.

# replayed gnome terminal startup throughput
# Workload                   bfq         mq-deadline
   0r-raw_seq             13.2617             13.4867
   10r-raw_seq            512.507              539.95

# replayed gnome terminal startup time
# Workload                   bfq         mq-deadline
   0r-raw_seq                0.43                 0.4
   10r-raw_seq               0.685             4.1625

# replayed lowriter startup throughput
# Workload                   bfq         mq-deadline
   0r-raw_seq               9.985              10.375
   10r-raw_seq             516.62              539.61

# replayed lowriter startup time
# Workload                   bfq         mq-deadline
   0r-raw_seq                 0.4              0.3875
   10r-raw_seq              0.535              2.3875

# replayed xterm startup throughput
# Workload                   bfq         mq-deadline
   0r-raw_seq             5.93833             6.10834
   10r-raw_seq            524.447             539.991

# replayed xterm startup time
# Workload                   bfq         mq-deadline
   0r-raw_seq                0.23                0.23
   10r-raw_seq               0.38                1.56

# throughput
# Workload                   bfq         mq-deadline
   10r-raw_rand           362.446             363.817
   10r-raw_seq            537.646             540.609
   1r-raw_seq             500.733             502.526

Throughput-wise, BFQ is on-par with mq-deadline. Latency-wise, BFQ is 
much-much better.
diff mbox series

Patch

diff --git a/block/elevator.c b/block/elevator.c
index 8fdcd64ae12e..6e6048ca3471 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,16 @@  int elevator_switch_mq(struct request_queue *q,
 }
 
 /*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices.  If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * For blk-mq devices, we default to using:
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
  */
 int elevator_init_mq(struct request_queue *q)
 {
 	struct elevator_type *e;
+	const char *policy;
 	int err = 0;
 
 	if (q->nr_hw_queues != 1)
@@ -968,7 +971,18 @@  int elevator_init_mq(struct request_queue *q)
 	if (unlikely(q->elevator))
 		goto out_unlock;
 
-	e = elevator_get(q, "mq-deadline", false);
+	/*
+	 * Zoned devices must use a deadline scheduler because currently
+	 * that is the only scheduler respecting zoned writes.
+	 */
+	if (blk_queue_is_zoned(q))
+		policy = "mq-deadline";
+	else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+		policy = "bfq";
+	else
+		policy = "mq-deadline";
+
+	e = elevator_get(q, policy, false);
 	if (!e)
 		goto out_unlock;