diff mbox series

block: BFQ default for single queue devices

Message ID 20181002124329.21248-1-linus.walleij@linaro.org (mailing list archive)
State New, archived
Headers show
Series block: BFQ default for single queue devices | expand

Commit Message

Linus Walleij Oct. 2, 2018, 12:43 p.m. UTC
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but notably also UBI and
the loopback device.

I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited.

I talked to Pavel a bit back and it turns out he has a
usecase for BFQ as well and I bet he also would like it
as default scheduler for that system (Pavel tell us more,
I don't remember what it was!)

Intuitively I could understand that maybe we want to
leave the loop device (possibly others? nbd? rbd?) as
"none", as it is probably relying on a scheduler on the
device below it, so I'm open to passing in a scheduler hint
from the respective subsystem in say struct blk_mq_tag_set.
However that makes for a bit of syntactic dissonance
with the struct member ".nr_hw_queues" (I wonder how
the loop device can have 1 "hardware queue"?) so
maybe we should in that case also rename that struct
member to ".nr_queues" fair and square before we start
making adjustments for treating queues differently whether
they are in hardware or actually not.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 block/elevator.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Comments

Jens Axboe Oct. 2, 2018, 2:31 p.m. UTC | #1
On 10/2/18 6:43 AM, Linus Walleij wrote:
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but notably also UBI and
> the loopback device.
> 
> I have been running it for a while without any negative
> effects on my pet systems and I want some wider testing
> so let's throw it out there and see what people say.
> Admittedly my use cases are limited.
> 
> I talked to Pavel a bit back and it turns out he has a
> usecase for BFQ as well and I bet he also would like it
> as default scheduler for that system (Pavel tell us more,
> I don't remember what it was!)
> 
> Intuitively I could understand that maybe we want to
> leave the loop device (possibly others? nbd? rbd?) as
> "none", as it is probably relying on a scheduler on the
> device below it, so I'm open to passing in a scheduler hint
> from the respective subsystem in say struct blk_mq_tag_set.
> However that makes for a bit of syntactic dissonance
> with the struct member ".nr_hw_queues" (I wonder how
> the loop device can have 1 "hardware queue"?) so
> maybe we should in that case also rename that struct
> member to ".nr_queues" fair and square before we start
> making adjustments for treating queues differently whether
> they are in hardware or actually not.

I think this should just be done with udev rules, and I'd
prefer if the distros would lead the way on this, as they
are the ones that will most likely see the most bug reports
on a change like this.
Linus Walleij Oct. 2, 2018, 2:45 p.m. UTC | #2
On Tue, Oct 2, 2018 at 4:31 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/2/18 6:43 AM, Linus Walleij wrote:

> > This sets BFQ as the default scheduler for single queue
> > block devices (nr_hw_queues == 1) if it is available. This
> > affects notably MMC/SD-cards but notably also UBI and
> > the loopback device.
>
> I think this should just be done with udev rules, and I'd
> prefer if the distros would lead the way on this, as they
> are the ones that will most likely see the most bug reports
> on a change like this.

AFAICT there is no sysfs property that
states how many hw queues the device has. And what
we want to do is activate BFQ when there is one HW
queue.

Should I make a patch to add a nr_hw_queues sysfs
file for this purpose in that case?

That will be a slightly misleading file for loop or networked
devices.

udev is a way to do this with desktop/server distros that has
"standard" (as they think about it) userspace. They can even
do it from their initrd/initramfs to mount root using BFQ
I guess (quick handover from e.g. UEFI).

However this is not a very good fit with Embedded systems,
as they tend to be minimal, not use udev (e.g. Android,
OpenWRT, busybox-derivates...) they don't do udev
rules, but I guess they can in theory do other scripts.
But they will mount root before anything like that can
happen. They don't use initrd/initramfs.

What I want to achieve
is to mount my rootfs with BFQ but that is not possible
on embedded systems that do not use initramfs, e.g.
a rootfs on MMC/SD or UBI.

Yours,
Linus Walleij
Richard Weinberger Oct. 2, 2018, 9:28 p.m. UTC | #3
Linus,

Am Dienstag, 2. Oktober 2018, 14:43:29 CEST schrieb Linus Walleij:
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but notably also UBI and
> the loopback device.

did you notice a difference for UBI?
Strictly speaking it affects only ubibock, the read-only
block device on top of an UBI volume.

Thanks,
//richard
Paolo Valente Oct. 3, 2018, 6:29 a.m. UTC | #4
> Il giorno 02 ott 2018, alle ore 16:31, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 10/2/18 6:43 AM, Linus Walleij wrote:
>> This sets BFQ as the default scheduler for single queue
>> block devices (nr_hw_queues == 1) if it is available. This
>> affects notably MMC/SD-cards but notably also UBI and
>> the loopback device.
>> 
>> I have been running it for a while without any negative
>> effects on my pet systems and I want some wider testing
>> so let's throw it out there and see what people say.
>> Admittedly my use cases are limited.
>> 
>> I talked to Pavel a bit back and it turns out he has a
>> usecase for BFQ as well and I bet he also would like it
>> as default scheduler for that system (Pavel tell us more,
>> I don't remember what it was!)
>> 
>> Intuitively I could understand that maybe we want to
>> leave the loop device (possibly others? nbd? rbd?) as
>> "none", as it is probably relying on a scheduler on the
>> device below it, so I'm open to passing in a scheduler hint
>> from the respective subsystem in say struct blk_mq_tag_set.
>> However that makes for a bit of syntactic dissonance
>> with the struct member ".nr_hw_queues" (I wonder how
>> the loop device can have 1 "hardware queue"?) so
>> maybe we should in that case also rename that struct
>> member to ".nr_queues" fair and square before we start
>> making adjustments for treating queues differently whether
>> they are in hardware or actually not.
> 
> I think this should just be done with udev rules, and I'd
> prefer if the distros would lead the way on this, as they
> are the ones that will most likely see the most bug reports
> on a change like this.
> 

Hi Jens,
I see your point, but I doubt this is the way to go, because of the
following flaws.

As also Linus Torvalds complained [1], people feel lost among
I/O-scheduler options.  Actual differences across I/O schedulers are
basically obscure to non experts.  In this respect, Linux-kernel
'users' are way more than a few top-level distros that can afford a
strong performance team, and that, basing on the input of such a team,
might venture light-heartedly to change a critical component like an
I/O scheduler.  Plus, as Linus Walleij pointed out, some users simply
are not distros that use udev.

So, probably 99% of Linux-kernel users will just stick to the default
I/O scheduler, mq-deadline, assuming that the algorithm by which that
scheduler was chosen was not "pick the scheduler with the longest
name", but "pick the best scheduler for most cases".  The problem is
that, for single-queue devices with a speed below 400/500 KIOPS, the
default scheduler is apparently incomparably worse than bfq in terms
of responsiveness and latency for time-sensitive applications [2], and
in terms of throughput reached while controlling I/O [3].  And, in all
other tests ran so far, by any entity or group I'm aware of, bfq
results basically on par with or better than mq-deadline.

So, I do understand your need for conservativeness, but, after so much
evidence on single-queue devices, and so many years! :), what's the
point in keeping Linux worse for virtually everybody, by default?

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/2/21/791
[2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php
[3] https://lwn.net/Articles/763603/



> -- 
> Jens Axboe
Linus Walleij Oct. 3, 2018, 6:53 a.m. UTC | #5
On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:

> So, I do understand your need for conservativeness, but, after so much
> evidence on single-queue devices, and so many years! :), what's the
> point in keeping Linux worse for virtually everybody, by default?

I understand if we need to ease things in as well, I don't intend this
change for the current merge window or anything, since v4.19
will notably have this patch:

commit d5038a13eca72fb216c07eb717169092e92284f1
Author: Johannes Thumshirn <jthumshirn@suse.de>
Date:   Wed Jul 4 10:53:56 2018 +0200

    scsi: core: switch to scsi-mq by default

    It has been more than one year since we tried to change the default from
    legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to
    scsi-mq"). But due to issues with suspend/resume and performance problems
    it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default
    to scsi-mq"").

    In the meantime there have been a substantial amount of performance
    improvements and suspend/resume got fixed as well, thus we can re-enable
    scsi-mq without a significant performance penalty.

    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Reviewed-by: Hannes Reinecke <hare@suse.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Acked-by: John Garry <john.garry@huawei.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

I guess that patch can be a bit scary by itself. But IIUC it all went
fine this time!

But hey, if that works, that means $SUBJECT patch will enable BFQ on all
libata devices and any SCSI that is single queue as well, not just
"obscure" stuff like MMC/SD and UBI, and that is
indeed a massive crowd of legacy devices. But we're talking
v4.21 here.

Johannes, you might be interested in $SUBJECT patch.
It'd be nice to hear what SUSE people have to add, since they
are pretty proactive in this area.

Yours,
Linus Walleij
Artem Bityutskiy Oct. 3, 2018, 7:05 a.m. UTC | #6
On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
> So, I do understand your need for conservativeness, but, after so much
> evidence on single-queue devices, and so many years! :), what's the
> point in keeping Linux worse for virtually everybody, by default?

Sounds like what we just need a mechanism for the device (ubi block in
this case) to select the I/O scheduler. I doubt enhancing the default
scheduler selection logic in 'elevator.c' is the right answer. Just
give the driver authority to override the defaults.
Linus Walleij Oct. 3, 2018, 7:18 a.m. UTC | #7
On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:
> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
> > So, I do understand your need for conservativeness, but, after so much
> > evidence on single-queue devices, and so many years! :), what's the
> > point in keeping Linux worse for virtually everybody, by default?
>
> Sounds like what we just need a mechanism for the device (ubi block in
> this case) to select the I/O scheduler. I doubt enhancing the default
> scheduler selection logic in 'elevator.c' is the right answer. Just
> give the driver authority to override the defaults.

This might be true in the wider sense (like for what scheduler to
select for an NVME device with N channels) but $SUBJECT is just
trying to select BFQ (if available) for devices with one and only one
hardware queue.

That is AFAICT the only reasonable choice for anything with just
one hardware queue as things stand right now.

I have a slight reservation for the weird outliers like loopdev, which
has "one hardware queue" (.nr_hw_queues == 1) though this
makes no sense at all. So I would like to know what people think
about that. Maybe we should have .nr_queues and .nr_hw_queues
where the former is the number of logical queues and the latter
the actual number of hardware queues.

Yours,
Linus Walleij
Damien Le Moal Oct. 3, 2018, 7:42 a.m. UTC | #8
On 2018/10/03 16:18, Linus Walleij wrote:
> On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:
>> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
>>> So, I do understand your need for conservativeness, but, after so much
>>> evidence on single-queue devices, and so many years! :), what's the
>>> point in keeping Linux worse for virtually everybody, by default?
>>
>> Sounds like what we just need a mechanism for the device (ubi block in
>> this case) to select the I/O scheduler. I doubt enhancing the default
>> scheduler selection logic in 'elevator.c' is the right answer. Just
>> give the driver authority to override the defaults.
> 
> This might be true in the wider sense (like for what scheduler to
> select for an NVME device with N channels) but $SUBJECT is just
> trying to select BFQ (if available) for devices with one and only one
> hardware queue.
> 
> That is AFAICT the only reasonable choice for anything with just
> one hardware queue as things stand right now.
> 
> I have a slight reservation for the weird outliers like loopdev, which
> has "one hardware queue" (.nr_hw_queues == 1) though this
> makes no sense at all. So I would like to know what people think
> about that. Maybe we should have .nr_queues and .nr_hw_queues
> where the former is the number of logical queues and the latter
> the actual number of hardware queues.

There is another class of outliers: host-managed SMR disks (SATA and SCSI,
definitely single hw queue). For these, using mq-deadline is mandatory in many
cases in order to guarantee sequential write command delivery to the device
driver. Having the default changed to bfq, which as far as I know is not SMR
friendly (can sequential writes within a single zone be reordered ?) is asking
for troubles (unaligned write errors showing up).

A while back, we already had this discussion with Jens and Christoph on the list
to allow device drivers to set a sensible default I/O scheduler for devices with
"special needs" (e.g. host-managed SMR). At the time, the conclusion was that
udev (or something alike in userland) is better suited to set a correct scheduler.

Of note also is that host-managed like sequential zone devices are also likely
to show up soon with the work being done in the NVMe standard on the new "Zoned
namespace" feature proposal. These devices will also require a scheduler like
mq-deadline guaranteeing per-zone in-order delivery of sequential write
requests. Looking only at the number of queues of the device is not enough to
choose the best (most reasonnable/appropriate) scheduler.
Linus Walleij Oct. 3, 2018, 8:28 a.m. UTC | #9
On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
> definitely single hw queue). For these, using mq-deadline is mandatory in many
> cases in order to guarantee sequential write command delivery to the device
> driver. Having the default changed to bfq, which as far as I know is not SMR
> friendly (can sequential writes within a single zone be reordered ?) is asking
> for troubles (unaligned write errors showing up).

Ah, that is interesting.

Which device driver files are we talking about here, specifically?
I'd like to take a look.

I guess what you say is not that you are looking for the deadline
scheduling per se (as in deadline scheduling is nice), what you want is
the zone locking semantics in that scheduler, is that right?

I.e. this business:
blk_queue_is_zoned(q)
blk_req_zone_write_lock(rq);
blk_req_zone_write_unlock(rq);
and mq-deadline solves this with a spinlock.

I will augment the patch to enforce mq-deadline
if blk_queue_is_zoned(q) is true, as it is clear that
any device with that characteristic must use mq-deadline.

Paoly might be interested in looking into whether BFQ could
also handle zoned devices in the future, I have no idea of how
hard that would be.

The zoned business seems a bit fragile. Should it even be
allowed to select any other scheduler than deadline on these
devices? Presenting all compiled in schedulers in
/sysblock/device/queue/scheduler sounds like just giving
sysadmins too much rope.

Yours,
Linus Walleij
Damien Le Moal Oct. 3, 2018, 8:53 a.m. UTC | #10
Linus,

On 2018/10/03 17:28, Linus Walleij wrote:
> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
>> definitely single hw queue). For these, using mq-deadline is mandatory in many
>> cases in order to guarantee sequential write command delivery to the device
>> driver. Having the default changed to bfq, which as far as I know is not SMR
>> friendly (can sequential writes within a single zone be reordered ?) is asking
>> for troubles (unaligned write errors showing up).
> 
> Ah, that is interesting.
> 
> Which device driver files are we talking about here, specifically?
> I'd like to take a look.

Currently, sd.c (SCSI disk) as well as null_blk can expose host-managed zoned
block devices.

> I guess what you say is not that you are looking for the deadline
> scheduling per se (as in deadline scheduling is nice), what you want is
> the zone locking semantics in that scheduler, is that right?

Yes, correct. The scheduling policy in itself does not really matter, but should
not deviate from the mandatory HM write policy: "within a sequential write
required zone, writes must be issued sequentially". That could somewhat impacts
the scheduler code itself if said scheduler think that not dispatching
sequential writes in sequence is a good idea :) No sane scheduler would that
though (at least on HDDs) so the impact on the scheduler code itself is reduced.

> I.e. this business:
> blk_queue_is_zoned(q)
> blk_req_zone_write_lock(rq);
> blk_req_zone_write_unlock(rq);
> and mq-deadline solves this with a spinlock.

Yes. These are the helper functions handling the zone write locking to simplify
the task of the scheduler to limit the number of in-flight write request to one
per zone at most at any time. This is the trick to avoid write reordering
stack-wide, since most of the time, reordering happens not because of the
scheduler itself, but the blk-mq (or legacy path) around it (e.g. requeue due to
resource shortage, multiple contexts running the queues, etc).

> I will augment the patch to enforce mq-deadline
> if blk_queue_is_zoned(q) is true, as it is clear that
> any device with that characteristic must use mq-deadline.
> 
> Paoly might be interested in looking into whether BFQ could
> also handle zoned devices in the future, I have no idea of how
> hard that would be.

It was rather easy with deadline, but the scheduler code was simple to start
with. Basically, the only thing needed is on dispatch to skip any write request
to a zone that is already locked (i.e. a write is already ongoing). For reads,
there are no constraints so nothing needs to be changed. Zone unlocking must be
done on completion of the write request so the scheduler completion method needs
to change a little too.

> The zoned business seems a bit fragile. Should it even be
> allowed to select any other scheduler than deadline on these
> devices? Presenting all compiled in schedulers in
> /sysblock/device/queue/scheduler sounds like just giving
> sysadmins too much rope.

Yes, that is debatable. But the above "one write per zone" trick can be also
handled by an application, which makes using any other scheduler OK. Look at the
recent SMR support code in fio from Bart. Some of the tests (in t/zbd) for some
I/O patterns are just fine with any scheduler. In fact any pattern is OK because
fio is SMR aware and never issues more than one write per zone. That is however
true only and only for I/O sizes that are small enough as to not cause the
kernel to generate multiple BIOs for each I/O call. Otherwise, deadline &
mq-deadline become necessary.

But I agree, it is a little fragile. Application developer and sysadmins really
need to know what will be running on the disk to make the right choice. And
knowing that is not necessarily straightforward.

Best regards.
Oleksandr Natalenko Oct. 3, 2018, 11:49 a.m. UTC | #11
Hi.

On 03.10.2018 08:29, Paolo Valente wrote:
> As also Linus Torvalds complained [1], people feel lost among
> I/O-scheduler options.  Actual differences across I/O schedulers are
> basically obscure to non experts.  In this respect, Linux-kernel
> 'users' are way more than a few top-level distros that can afford a
> strong performance team, and that, basing on the input of such a team,
> might venture light-heartedly to change a critical component like an
> I/O scheduler.  Plus, as Linus Walleij pointed out, some users simply
> are not distros that use udev.

I feel a contradiction in this counter-argument. On one hand, there are 
lots of, let's call them, home users, that use major distributions with 
udev, so the distribution maintainers can reasonably decide which 
scheduler to use for which type of device based on the udev rule and 
common sense provided via Documentation/ by linux-block devs. Moreover, 
most likely, those rules should be similar or the same across all the 
major distros and available via some (systemd?) upstream.

On another hand, the users of embedded devices, mentioned by Linus, 
should already know what scheduler to choose because dealing with 
embedded world assumes the person can decide this on their own, or with 
the help of abovementioned udev scripts and/or Documentation/ as a 
reference point.

So I see no obstacles here, and the choice to rely on udev by default 
sounds reasonable.

The question that remain is whether it is really important to mount a 
root partition while already using some specific scheduler? Why it 
cannot be done with "none", for instance?

> So, probably 99% of Linux-kernel users will just stick to the default
> I/O scheduler, mq-deadline, assuming that the algorithm by which that
> scheduler was chosen was not "pick the scheduler with the longest
> name", but "pick the best scheduler for most cases".  The problem is
> that, for single-queue devices with a speed below 400/500 KIOPS, the
> default scheduler is apparently incomparably worse than bfq in terms
> of responsiveness and latency for time-sensitive applications [2], and
> in terms of throughput reached while controlling I/O [3].  And, in all
> other tests ran so far, by any entity or group I'm aware of, bfq
> results basically on par with or better than mq-deadline.

And that's why major distributions are likely to default to BFQ via 
udev. No one argues with BFQ superiority here ☺.

> So, I do understand your need for conservativeness, but, after so much
> evidence on single-queue devices, and so many years! :), what's the
> point in keeping Linux worse for virtually everybody, by default?

 From my point of view this is not a conservative approach at all. On 
contrary, offloading decisions to userspace aligns pretty well with 
recent trends like pressure metrics/userspace OOM killer, eBPF etc. The 
less unnecessary logic the kernel handles, the more flexibility it 
affords.
Christoph Hellwig Oct. 3, 2018, 12:51 p.m. UTC | #12
On Wed, Oct 03, 2018 at 07:42:15AM +0000, Damien Le Moal wrote:
> Of note also is that host-managed like sequential zone devices are also likely
> to show up soon with the work being done in the NVMe standard on the new "Zoned
> namespace" feature proposal. These devices will also require a scheduler like
> mq-deadline guaranteeing per-zone in-order delivery of sequential write
> requests. Looking only at the number of queues of the device is not enough to
> choose the best (most reasonnable/appropriate) scheduler.

We actually have a plan to avoid the need for a non-reordering scheduler
there (including a Linux prototype for it).  Lets see if it survives the
committee.
Jan Kara Oct. 3, 2018, 1:25 p.m. UTC | #13
On Wed 03-10-18 08:53:37, Linus Walleij wrote:
> On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:
> 
> > So, I do understand your need for conservativeness, but, after so much
> > evidence on single-queue devices, and so many years! :), what's the
> > point in keeping Linux worse for virtually everybody, by default?
> 
> I understand if we need to ease things in as well, I don't intend this
> change for the current merge window or anything, since v4.19
> will notably have this patch:
> 
> commit d5038a13eca72fb216c07eb717169092e92284f1
> Author: Johannes Thumshirn <jthumshirn@suse.de>
> Date:   Wed Jul 4 10:53:56 2018 +0200
> 
>     scsi: core: switch to scsi-mq by default
> 
>     It has been more than one year since we tried to change the default from
>     legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to
>     scsi-mq"). But due to issues with suspend/resume and performance problems
>     it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default
>     to scsi-mq"").
> 
>     In the meantime there have been a substantial amount of performance
>     improvements and suspend/resume got fixed as well, thus we can re-enable
>     scsi-mq without a significant performance penalty.
> 
>     Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
>     Reviewed-by: Hannes Reinecke <hare@suse.com>
>     Reviewed-by: Ming Lei <ming.lei@redhat.com>
>     Acked-by: John Garry <john.garry@huawei.com>
>     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> 
> I guess that patch can be a bit scary by itself. But IIUC it all went
> fine this time!
> 
> But hey, if that works, that means $SUBJECT patch will enable BFQ on all
> libata devices and any SCSI that is single queue as well, not just
> "obscure" stuff like MMC/SD and UBI, and that is
> indeed a massive crowd of legacy devices. But we're talking
> v4.21 here.
> 
> Johannes, you might be interested in $SUBJECT patch.
> It'd be nice to hear what SUSE people have to add, since they
> are pretty proactive in this area.

So we do have a udev rules in our distro which sets the IO scheduler based
on device parameters (rotational at least, with blk-mq we might start
considering number of queues as well, plus we have some exceptions like
virtio, loop, etc.). So the kernel default doesn't concern us too much as a
distro.

I personally would consider bfq a safer default for single-queue devices
(loop probably needs exception) but I don't feel too strongly about it.

								Honza
Mark Brown Oct. 3, 2018, 2:51 p.m. UTC | #14
On Wed, Oct 03, 2018 at 01:49:25PM +0200, Oleksandr Natalenko wrote:

> On another hand, the users of embedded devices, mentioned by Linus, should
> already know what scheduler to choose because dealing with embedded world
> assumes the person can decide this on their own, or with the help of
> abovementioned udev scripts and/or Documentation/ as a reference point.

That's not an entirely realistic assessment of a lot of practical
embedded development - while people *can* go in and tweak things to
their heart's content and some will have the time to do that there's a
lot of small teams pulling together entire systems who rely fairly
heavily on defaults, focusing most of their effort on the bits of code
they directly wrote.  You get things like people taking a copy of an
embedded distro at some point and then only updating components that
they specifically want to update like the new kernel with the drivers
for the SoC in the new product.

> So I see no obstacles here, and the choice to rely on udev by default sounds
> reasonable.

There's still a good number of users where there's a big discoverability
problem here I fear.

We have this regularly with the arm64 fixups for emulating old locking
constructs that were removed from the architecture (useful for running
old arm binaries on arm64 systems), that's got a Kconfig option but also
requires enabling at runtime.  I've had to help several users who were
completely frustrated trying to get their old binaries working having
upgraded to a kernel with the option, turned it on in Kconfig and then
being unaware that there was also this hoop userspace had to jump
through.  This is less severe as it's only a performance thing but still
potentially annoying.
Bart Van Assche Oct. 3, 2018, 2:58 p.m. UTC | #15
On Wed, 2018-10-03 at 05:51 -0700, Christoph Hellwig wrote:
> On Wed, Oct 03, 2018 at 07:42:15AM +0000, Damien Le Moal wrote:
> > Of note also is that host-managed like sequential zone devices are also likely
> > to show up soon with the work being done in the NVMe standard on the new "Zoned
> > namespace" feature proposal. These devices will also require a scheduler like
> > mq-deadline guaranteeing per-zone in-order delivery of sequential write
> > requests. Looking only at the number of queues of the device is not enough to
> > choose the best (most reasonnable/appropriate) scheduler.
> 
> We actually have a plan to avoid the need for a non-reordering scheduler
> there (including a Linux prototype for it).  Lets see if it survives the
> committee.

Has the work with the T10 committee to standardize the SCSI equivalent of anonymous
writes already started?

Thanks,

Bart.
Christoph Hellwig Oct. 3, 2018, 3:01 p.m. UTC | #16
On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote:
> Has the work with the T10 committee to standardize the SCSI equivalent of anonymous
> writes already started?

No, and I don't know of anyone who wants to do that in the short term.
Bart Van Assche Oct. 3, 2018, 3:15 p.m. UTC | #17
On Wed, 2018-10-03 at 08:01 -0700, Christoph Hellwig wrote:
> On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote:
> > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous
> > writes already started?
> 
> No, and I don't know of anyone who wants to do that in the short term.

That's unfortunate. I think having such a command available in the SCSI
command set would be a step forward.

Bart.
Paolo Valente Oct. 3, 2018, 3:51 p.m. UTC | #18
> Il giorno 02 ott 2018, alle ore 14:43, Linus Walleij <linus.walleij@linaro.org> ha scritto:
> 
> This sets BFQ as the default scheduler for single queue
> block devices (nr_hw_queues == 1) if it is available. This
> affects notably MMC/SD-cards but notably also UBI and
> the loopback device.
> 
> I have been running it for a while without any negative
> effects on my pet systems and I want some wider testing
> so let's throw it out there and see what people say.
> Admittedly my use cases are limited.
> 
> I talked to Pavel a bit back and it turns out he has a
> usecase for BFQ as well and I bet he also would like it
> as default scheduler for that system (Pavel tell us more,
> I don't remember what it was!)
> 
> Intuitively I could understand that maybe we want to
> leave the loop device

Actually, I've tested loop devices too.  And, also with these virtual
devices, switching to bfq radically improves figures of merits as
responsiveness and latency for soft real-time applications.

Thanks,
Paolo


> (possibly others? nbd? rbd?) as
> "none", as it is probably relying on a scheduler on the
> device below it, so I'm open to passing in a scheduler hint
> from the respective subsystem in say struct blk_mq_tag_set.
> However that makes for a bit of syntactic dissonance
> with the struct member ".nr_hw_queues" (I wonder how
> the loop device can have 1 "hardware queue"?) so
> maybe we should in that case also rename that struct
> member to ".nr_queues" fair and square before we start
> making adjustments for treating queues differently whether
> they are in hardware or actually not.
> 
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Paolo Valente <paolo.valente@linaro.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Artem Bityutskiy <dedekind1@gmail.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
> block/elevator.c | 21 ++++++++++++++-------
> 1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index e18ac68626e3..e5a2c39eee7b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -948,13 +948,15 @@ int elevator_switch_mq(struct request_queue *q,
> }
> 
> /*
> - * For blk-mq devices, we default to using mq-deadline, if available, for single
> - * queue devices.  If deadline isn't available OR we have multiple queues,
> - * default to "none".
> + * For blk-mq devices, we default to using:
> + * - "none" for multiqueue devices (nr_hw_queues != 1)
> + * - "bfq", if available, for single queue devices
> + * - "mq-deadline" if "bfq" is not available for single queue devices
> + * - "none" for single queue devices as well as last resort
>  */
> int elevator_init_mq(struct request_queue *q)
> {
> -	struct elevator_type *e;
> +	struct elevator_type *e = NULL;
> 	int err = 0;
> 
> 	if (q->nr_hw_queues != 1)
> @@ -968,9 +970,14 @@ int elevator_init_mq(struct request_queue *q)
> 	if (unlikely(q->elevator))
> 		goto out_unlock;
> 
> -	e = elevator_get(q, "mq-deadline", false);
> -	if (!e)
> -		goto out_unlock;
> +	if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
> +		e = elevator_get(q, "bfq", false);
> +
> +	if (!e) {
> +		e = elevator_get(q, "mq-deadline", false);
> +		if (!e)
> +			goto out_unlock;
> +	}
> 
> 	err = blk_mq_init_sched(q, e);
> 	if (err)
> -- 
> 2.17.1
>
Paolo Valente Oct. 3, 2018, 3:52 p.m. UTC | #19
> Il giorno 03 ott 2018, alle ore 09:42, Damien Le Moal <damien.lemoal@wdc.com> ha scritto:
> 
> On 2018/10/03 16:18, Linus Walleij wrote:
>> On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:
>>> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
>>>> So, I do understand your need for conservativeness, but, after so much
>>>> evidence on single-queue devices, and so many years! :), what's the
>>>> point in keeping Linux worse for virtually everybody, by default?
>>> 
>>> Sounds like what we just need a mechanism for the device (ubi block in
>>> this case) to select the I/O scheduler. I doubt enhancing the default
>>> scheduler selection logic in 'elevator.c' is the right answer. Just
>>> give the driver authority to override the defaults.
>> 
>> This might be true in the wider sense (like for what scheduler to
>> select for an NVME device with N channels) but $SUBJECT is just
>> trying to select BFQ (if available) for devices with one and only one
>> hardware queue.
>> 
>> That is AFAICT the only reasonable choice for anything with just
>> one hardware queue as things stand right now.
>> 
>> I have a slight reservation for the weird outliers like loopdev, which
>> has "one hardware queue" (.nr_hw_queues == 1) though this
>> makes no sense at all. So I would like to know what people think
>> about that. Maybe we should have .nr_queues and .nr_hw_queues
>> where the former is the number of logical queues and the latter
>> the actual number of hardware queues.
> 
> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
> definitely single hw queue). For these, using mq-deadline is mandatory in many
> cases in order to guarantee sequential write command delivery to the device
> driver. Having the default changed to bfq, which as far as I know is not SMR
> friendly (can sequential writes within a single zone be reordered ?) is asking
> for troubles (unaligned write errors showing up).
> 

Hi Damien,
actually I have followed threads on SMR device, and have already looked
into this.  I'm sorry for not having mentioned it in my first reply.

My plan is to simply port this feature from mq-deadline to bfq.  It
should be really straightforward, especially after the testing you did
through mq-deadline.  Even if I'm missing some less trivial hidden
issue, I guess it won't be impossible to address.

If it may be useful for the outcome of this thread, I'm willing to
raise the priority of this change to bfq.

> A while back, we already had this discussion with Jens and Christoph on the list
> to allow device drivers to set a sensible default I/O scheduler for devices with
> "special needs" (e.g. host-managed SMR). At the time, the conclusion was that
> udev (or something alike in userland) is better suited to set a correct scheduler.
> 
> Of note also is that host-managed like sequential zone devices are also likely
> to show up soon with the work being done in the NVMe standard on the new "Zoned
> namespace" feature proposal. These devices will also require a scheduler like
> mq-deadline guaranteeing per-zone in-order delivery of sequential write
> requests. Looking only at the number of queues of the device is not enough to
> choose the best (most reasonnable/appropriate) scheduler.
> 

Until bfq simply handles SMR devices too.

Thanks,
Paolo

> -- 
> Damien Le Moal
> Western Digital Research
Paolo Valente Oct. 3, 2018, 3:53 p.m. UTC | #20
> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@linaro.org> ha scritto:
> 
> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
>> definitely single hw queue). For these, using mq-deadline is mandatory in many
>> cases in order to guarantee sequential write command delivery to the device
>> driver. Having the default changed to bfq, which as far as I know is not SMR
>> friendly (can sequential writes within a single zone be reordered ?) is asking
>> for troubles (unaligned write errors showing up).
> 
> Ah, that is interesting.
> 
> Which device driver files are we talking about here, specifically?
> I'd like to take a look.
> 
> I guess what you say is not that you are looking for the deadline
> scheduling per se (as in deadline scheduling is nice), what you want is
> the zone locking semantics in that scheduler, is that right?
> 
> I.e. this business:
> blk_queue_is_zoned(q)
> blk_req_zone_write_lock(rq);
> blk_req_zone_write_unlock(rq);
> and mq-deadline solves this with a spinlock.
> 
> I will augment the patch to enforce mq-deadline
> if blk_queue_is_zoned(q) is true, as it is clear that
> any device with that characteristic must use mq-deadline.
> 
> Paoly might be interested in looking into whether BFQ could
> also handle zoned devices in the future, I have no idea of how
> hard that would be.
> 

Absolutely, as I already wrote in my reply to Damien.

In the meantime, Linus, augmenting your patch as you propose seems
a clean and effective solution to me.

Thanks,
Paolo

> The zoned business seems a bit fragile. Should it even be
> allowed to select any other scheduler than deadline on these
> devices? Presenting all compiled in schedulers in
> /sysblock/device/queue/scheduler sounds like just giving
> sysadmins too much rope.
> 
> Yours,
> Linus Walleij
Bart Van Assche Oct. 3, 2018, 3:54 p.m. UTC | #21
On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
> [1] https://lkml.org/lkml/2017/2/21/791
> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php
> [3] https://lwn.net/Articles/763603/

From [2]: "BFQ loses about 18% with only random readers, because the number
of IOPS becomes so high that the execution time and parallel efficiency of
the schedulers becomes relevant." Since the number of I/O patterns for which
results are available on [2] is limited and since the number of devices for
which test results are available on [2] is limited (e.g. RAID is missing),
there might be other cases in which configuring BFQ as the default would
introduce a regression.

I agree with Jens that it's best to leave it to the Linux distributors to
select a default I/O scheduler.

Bart.
Paolo Valente Oct. 3, 2018, 3:55 p.m. UTC | #22
> Il giorno 03 ott 2018, alle ore 13:49, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto:
> 
> Hi.
> 
> On 03.10.2018 08:29, Paolo Valente wrote:
>> As also Linus Torvalds complained [1], people feel lost among
>> I/O-scheduler options.  Actual differences across I/O schedulers are
>> basically obscure to non experts.  In this respect, Linux-kernel
>> 'users' are way more than a few top-level distros that can afford a
>> strong performance team, and that, basing on the input of such a team,
>> might venture light-heartedly to change a critical component like an
>> I/O scheduler.  Plus, as Linus Walleij pointed out, some users simply
>> are not distros that use udev.
> 
> I feel a contradiction in this counter-argument. On one hand, there are lots of, let's call them, home users, that use major distributions with udev, so the distribution maintainers can reasonably decide which scheduler to use for which type of device based on the udev rule and common sense provided via Documentation/ by linux-block devs. Moreover, most likely, those rules should be similar or the same across all the major distros and available via some (systemd?) upstream.
> 

Let me basically repeat Mark's answer here, with my words.

Unfortunately, facts mismatch with your optimistic view: after so many
years and concordant test results, only very few distributions
switched to bfq, no major distribution did (AFAIK).  As I already
wrote, the reason is the one pointed out by Torvalds [1].  Do you want
a simple example?  Take the last sentence in Jan's email in this
thread: "I *personally would* consider bfq a safer default ...  but *I
don't feel too strongly* about it." And he is definitely a storage
expert.

The problem, in particular, is that bfq is a complex beast, fighting
against a jungle of I/O issues.  You have to be really into bfq, even
to just know all of its features!

> On another hand, the users of embedded devices, mentioned by Linus, should already know what scheduler to choose because dealing with embedded world assumes the person can decide this on their own, or with the help of abovementioned udev scripts and/or Documentation/ as a reference point.
> 

Same situation for embedded devices, if not even worse.  Again for the
same reasons above.  In the end, it is hard even for a kernel expert
to be an in-depth expert of every possible complex component.

> So I see no obstacles here, and the choice to rely on udev by default sounds reasonable.
> 
> The question that remain is whether it is really important to mount a root partition while already using some specific scheduler? Why it cannot be done with "none", for instance?
> 
>> So, probably 99% of Linux-kernel users will just stick to the default
>> I/O scheduler, mq-deadline, assuming that the algorithm by which that
>> scheduler was chosen was not "pick the scheduler with the longest
>> name", but "pick the best scheduler for most cases".  The problem is
>> that, for single-queue devices with a speed below 400/500 KIOPS, the
>> default scheduler is apparently incomparably worse than bfq in terms
>> of responsiveness and latency for time-sensitive applications [2], and
>> in terms of throughput reached while controlling I/O [3].  And, in all
>> other tests ran so far, by any entity or group I'm aware of, bfq
>> results basically on par with or better than mq-deadline.
> 
> And that's why major distributions are likely to default to BFQ via udev. No one argues with BFQ superiority here ☺.
> 
>> So, I do understand your need for conservativeness, but, after so much
>> evidence on single-queue devices, and so many years! :), what's the
>> point in keeping Linux worse for virtually everybody, by default?
> 
> From my point of view this is not a conservative approach at all. On contrary, offloading decisions to userspace aligns pretty well with recent trends like pressure metrics/userspace OOM killer, eBPF etc. The less unnecessary logic the kernel handles, the more flexibility it affords.
> 

To not answer too seriously here, let me answer with a quote that is
still missing a clear paternity: "Everything should be made as simple
as possible, but not simpler." :)

Thanks,
Paolo

> -- 
>  Oleksandr Natalenko (post-factum)
Bart Van Assche Oct. 3, 2018, 4 p.m. UTC | #23
On Wed, 2018-10-03 at 17:55 +0200, Paolo Valente wrote:
> The problem, in particular, is that bfq is a complex beast, fighting
> against a jungle of I/O issues.  You have to be really into bfq, even
> to just know all of its features!

This is a problem by itself. I don't know anyone who wants to have to deal
with I/O scheduler tunables.

Bart.
Paolo Valente Oct. 3, 2018, 4:02 p.m. UTC | #24
> Il giorno 03 ott 2018, alle ore 17:54, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
>> [1] https://lkml.org/lkml/2017/2/21/791
>> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php
>> [3] https://lwn.net/Articles/763603/
> 
> From [2]: "BFQ loses about 18% with only random readers, because the number
> of IOPS becomes so high that the execution time and parallel efficiency of
> the schedulers becomes relevant." Since the number of I/O patterns for which
> results are available on [2] is limited and since the number of devices for
> which test results are available on [2] is limited (e.g. RAID is missing),
> there might be other cases in which configuring BFQ as the default would
> introduce a regression.
> 

From [3]: none with throttling loses 80% of the throughput when used
to control I/O. On any drive. And this is really only one example among a ton.

In addition, the test you mention, designed by me, was meant exactly
to find and show the worst breaking point of BFQ.  If your main
workload of interest is really made only of tens of parallel thread
doing only sync random I/O, and you care only about throughput,
without any concern for your system becoming so unresponsive to be
unusable during the test, then, yes, mq-deadline is a better option
for you.

So, are you really sure the balance is in favor of mq-deadline?

Thanks,
Paolo

> I agree with Jens that it's best to leave it to the Linux distributors to
> select a default I/O scheduler.
> 
> Bart.
Paolo Valente Oct. 3, 2018, 4:04 p.m. UTC | #25
> Il giorno 03 ott 2018, alle ore 18:00, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On Wed, 2018-10-03 at 17:55 +0200, Paolo Valente wrote:
>> The problem, in particular, is that bfq is a complex beast, fighting
>> against a jungle of I/O issues.  You have to be really into bfq, even
>> to just know all of its features!
> 
> This is a problem by itself. I don't know anyone who wants to have to deal
> with I/O scheduler tunables.
> 

In fact, I designed and am constantly improving bfq, exactly so that
you don't have to touch any tunable.

Thanks,
Paolo

> Bart.
>
Paolo Valente Oct. 3, 2018, 5:22 p.m. UTC | #26
> Il giorno 03 ott 2018, alle ore 18:02, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 03 ott 2018, alle ore 17:54, Bart Van Assche <bvanassche@acm.org> ha scritto:
>> 
>> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
>>> [1] https://lkml.org/lkml/2017/2/21/791
>>> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php
>>> [3] https://lwn.net/Articles/763603/
>> 
>> From [2]: "BFQ loses about 18% with only random readers, because the number
>> of IOPS becomes so high that the execution time and parallel efficiency of
>> the schedulers becomes relevant." Since the number of I/O patterns for which
>> results are available on [2] is limited and since the number of devices for
>> which test results are available on [2] is limited (e.g. RAID is missing),
>> there might be other cases in which configuring BFQ as the default would
>> introduce a regression.
>> 
> 
> From [3]: none with throttling loses 80% of the throughput when used
> to control I/O. On any drive. And this is really only one example among a ton.
> 

I forgot to add that the same 80% loss happens with mq-deadline plus
throttling, sorry.  In addition, mq-deadline suffers from much more
than a 18% loss of throughput, w.r.t. bfq, exactly in the same figure
you cited, if there are random writes too.

> In addition, the test you mention, designed by me, was meant exactly
> to find and show the worst breaking point of BFQ.  If your main
> workload of interest is really made only of tens of parallel thread
> doing only sync random I/O, and you care only about throughput,
> without any concern for your system becoming so unresponsive to be
> unusable during the test, then, yes, mq-deadline is a better option
> for you.
> 

Some more detail on this.  The fact that bfq reaches a lower
throughput than none in this test is actually still puzzling me,
because the process rate of I/O with bfq is one order of magnitude
higher than the IOPS of this device.  So, I still don't understand
why, with bfq, the queue of the device does not get as full as with
none, and thus why the throughput with bfq is not the same as with
none.

To further test this issue, I replaced sync I/O with async I/O (with a
very high depth).  And, nonsensically (for me), throughput dropped
with both bfq and none!  I already meant to to report this issue,
after investigating it more.  Anyway, this is a different story w.r.t.
this thread.

Thanks,
Paolo


> So, are you really sure the balance is in favor of mq-deadline?
> 
> Thanks,
> Paolo
> 
>> I agree with Jens that it's best to leave it to the Linux distributors to
>> select a default I/O scheduler.
>> 
>> Bart.
> 
> -- 
> You received this message because you are subscribed to the Google Groups "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Bryan Gurney Oct. 3, 2018, 5:34 p.m. UTC | #27
On Wed, Oct 3, 2018 at 11:53 AM, Paolo Valente <paolo.valente@linaro.org> wrote:
>
>
>> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@linaro.org> ha scritto:
>>
>> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>>
>>> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
>>> definitely single hw queue). For these, using mq-deadline is mandatory in many
>>> cases in order to guarantee sequential write command delivery to the device
>>> driver. Having the default changed to bfq, which as far as I know is not SMR
>>> friendly (can sequential writes within a single zone be reordered ?) is asking
>>> for troubles (unaligned write errors showing up).
>>
>> Ah, that is interesting.
>>
>> Which device driver files are we talking about here, specifically?
>> I'd like to take a look.
>>
>> I guess what you say is not that you are looking for the deadline
>> scheduling per se (as in deadline scheduling is nice), what you want is
>> the zone locking semantics in that scheduler, is that right?
>>
>> I.e. this business:
>> blk_queue_is_zoned(q)
>> blk_req_zone_write_lock(rq);
>> blk_req_zone_write_unlock(rq);
>> and mq-deadline solves this with a spinlock.
>>
>> I will augment the patch to enforce mq-deadline
>> if blk_queue_is_zoned(q) is true, as it is clear that
>> any device with that characteristic must use mq-deadline.
>>
>> Paoly might be interested in looking into whether BFQ could
>> also handle zoned devices in the future, I have no idea of how
>> hard that would be.
>>
>
> Absolutely, as I already wrote in my reply to Damien.
>
> In the meantime, Linus, augmenting your patch as you propose seems
> a clean and effective solution to me.
>
> Thanks,
> Paolo
>
>> The zoned business seems a bit fragile. Should it even be
>> allowed to select any other scheduler than deadline on these
>> devices? Presenting all compiled in schedulers in
>> /sysblock/device/queue/scheduler sounds like just giving
>> sysadmins too much rope.
>>
>> Yours,
>> Linus Walleij
>

Right now, users of host-managed SMR drives should be using "deadline"
or "mq-deadline", to avoid out-of-order writes in sequential-only
zones.

I'm running into a situation right now on a test system (Fedora 28,
4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I
accidentally forgot to add my "udev rule" file:

# cat /etc/udev/rules.d/99-zoned-block-devices.rules
ACTION=="add|change", KERNEL=="sd[a-z]",
ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline"

...and now, I see these messages when that specific SMR drive is mounted:

kernel: F2FS-fs (sdc): IO Block Size:        4 KB
kernel: F2FS-fs (sdc): Found nat_bits in checkpoint
kernel: F2FS-fs (sdc): Mounted with checkpoint version = 212216ab
kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08),
sub_code(0x0000)
kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08),
sub_code(0x0000)
kernel: scsi_io_completion: 20 callbacks suppressed
kernel: sd 7:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : Aborted Command [current]
kernel: sd 7:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
kernel: sd 7:0:0:0: [sdb] tag#0 CDB: Write(16) 8a 00 00 00 00 00 3d d4
ec 99 00 00 00 80 00 00

I was also running into problems with creating new directories on this
F2FS filesystem.  However, "fsck.f2fs" reports no problems.  So at
this point, I created a new F2FS filesystem on a second SMR drive, and
am currently copying the data from the "bad" F2FS filesystem to the
"good" one.

I wouldn't call zoned block devices "fragile"; they simply have I/O
rules that didn't previously exist: all writes to sequential-only
zones must be sequential.  And one of the things that schedulers do is
reorder writes.  After 4.16, sd stopped being the "gatekeeper" of
ensuring sequential writes, but the only "zoned-aware" schedulers were
deadline and mq-deadline.  Since my test system defaulted to "cfq", I
ran into problems.

So I welcome any changes that make it impossible for the user to
"accidentally use the wrong scheduler".

At least this time, I didn't "brick" my test system's BIOS, like I did
back in May of this year [1].


Thanks,

Bryan


[1] https://www.spinics.net/lists/linux-block/msg26798.html
Jan Kara Oct. 4, 2018, 7:38 a.m. UTC | #28
On Wed 03-10-18 17:55:41, Paolo Valente wrote:
> > On 03.10.2018 08:29, Paolo Valente wrote:
> >> As also Linus Torvalds complained [1], people feel lost among
> >> I/O-scheduler options.  Actual differences across I/O schedulers are
> >> basically obscure to non experts.  In this respect, Linux-kernel
> >> 'users' are way more than a few top-level distros that can afford a
> >> strong performance team, and that, basing on the input of such a team,
> >> might venture light-heartedly to change a critical component like an
> >> I/O scheduler.  Plus, as Linus Walleij pointed out, some users simply
> >> are not distros that use udev.
> > 
> > I feel a contradiction in this counter-argument. On one hand, there are lots of, let's call them, home users, that use major distributions with udev, so the distribution maintainers can reasonably decide which scheduler to use for which type of device based on the udev rule and common sense provided via Documentation/ by linux-block devs. Moreover, most likely, those rules should be similar or the same across all the major distros and available via some (systemd?) upstream.
> > 
> 
> Let me basically repeat Mark's answer here, with my words.
> 
> Unfortunately, facts mismatch with your optimistic view: after so many
> years and concordant test results, only very few distributions
> switched to bfq, no major distribution did (AFAIK).  As I already
> wrote, the reason is the one pointed out by Torvalds [1].  Do you want
> a simple example?  Take the last sentence in Jan's email in this
> thread: "I *personally would* consider bfq a safer default ...  but *I
> don't feel too strongly* about it." And he is definitely a storage
> expert.

Yeah, but let me add that currently all our released kernels still use legacy
block stack for SCSI by default and thus CFQ/deadline. And once we feel
scsi-mq + BFQ is comparable enough for rotating disks (which may be after
your latest changes, Andreas will be running some larger evaluation), we
are going to switch to that instead of scsi + CFQ. So it's not like for us
it is a question between deadline-mq and BFQ, it is rather between scsi +
CFQ vs scsi-mq + BFQ.

								Honza
Johannes Thumshirn Oct. 4, 2018, 7:45 a.m. UTC | #29
On Wed, Oct 03, 2018 at 03:25:54PM +0200, Jan Kara wrote:
> On Wed 03-10-18 08:53:37, Linus Walleij wrote:
> > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:
> > 
> > > So, I do understand your need for conservativeness, but, after so much
> > > evidence on single-queue devices, and so many years! :), what's the
> > > point in keeping Linux worse for virtually everybody, by default?
> > 
> > I understand if we need to ease things in as well, I don't intend this
> > change for the current merge window or anything, since v4.19
> > will notably have this patch:
> > 
> > commit d5038a13eca72fb216c07eb717169092e92284f1
> > Author: Johannes Thumshirn <jthumshirn@suse.de>
> > Date:   Wed Jul 4 10:53:56 2018 +0200
> > 
> >     scsi: core: switch to scsi-mq by default
> > 
> >     It has been more than one year since we tried to change the default from
> >     legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to
> >     scsi-mq"). But due to issues with suspend/resume and performance problems
> >     it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default
> >     to scsi-mq"").
> > 
> >     In the meantime there have been a substantial amount of performance
> >     improvements and suspend/resume got fixed as well, thus we can re-enable
> >     scsi-mq without a significant performance penalty.
> > 
> >     Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> >     Reviewed-by: Hannes Reinecke <hare@suse.com>
> >     Reviewed-by: Ming Lei <ming.lei@redhat.com>
> >     Acked-by: John Garry <john.garry@huawei.com>
> >     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> > 
> > I guess that patch can be a bit scary by itself. But IIUC it all went
> > fine this time!
> > 
> > But hey, if that works, that means $SUBJECT patch will enable BFQ on all
> > libata devices and any SCSI that is single queue as well, not just
> > "obscure" stuff like MMC/SD and UBI, and that is
> > indeed a massive crowd of legacy devices. But we're talking
> > v4.21 here.
> > 
> > Johannes, you might be interested in $SUBJECT patch.
> > It'd be nice to hear what SUSE people have to add, since they
> > are pretty proactive in this area.
> 
> So we do have a udev rules in our distro which sets the IO scheduler based
> on device parameters (rotational at least, with blk-mq we might start
> considering number of queues as well, plus we have some exceptions like
> virtio, loop, etc.). So the kernel default doesn't concern us too much as a
> distro.
> 
> I personally would consider bfq a safer default for single-queue devices
> (loop probably needs exception) but I don't feel too strongly about it.

[Full quote for context]

What about resurrecting CONFIG_DEFAULT_IOSCHED for MQ as well and
leave it default to mq-deadline but give bfq, kyber and none as a
choice as well?

The question is shall we only do it for single queue devices or for
native MQ devices as well if we go down that road?

I understand the embedded floks will want a different interface than
udev, but from the non-embedded point of view I'm with Jens and Jan
here, let udev do the job.

      Johannes
Linus Walleij Oct. 4, 2018, 8:21 a.m. UTC | #30
On Wed, Oct 3, 2018 at 7:34 PM Bryan Gurney <bgurney@redhat.com> wrote:

> Right now, users of host-managed SMR drives should be using "deadline"
> or "mq-deadline", to avoid out-of-order writes in sequential-only
> zones.
>
> I'm running into a situation right now on a test system (Fedora 28,
> 4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I
> accidentally forgot to add my "udev rule" file:

This should be fixed after
d5038a13eca7 scsi: core: switch to scsi-mq by default
right?

Since mq use mq-deadline by default.

I'm making sure to preserve mq-deadline on zoned devices
in my v2 of this patch.

Yours,
Linus Walleij
Andreas Herrmann Oct. 4, 2018, 8:24 a.m. UTC | #31
On Thu, Oct 04, 2018 at 09:45:35AM +0200, Johannes Thumshirn wrote:
> On Wed, Oct 03, 2018 at 03:25:54PM +0200, Jan Kara wrote:
> > On Wed 03-10-18 08:53:37, Linus Walleij wrote:
> > > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:
> > > 
> > > > So, I do understand your need for conservativeness, but, after so much
> > > > evidence on single-queue devices, and so many years! :), what's the
> > > > point in keeping Linux worse for virtually everybody, by default?
> > > 
> > > I understand if we need to ease things in as well, I don't intend this
> > > change for the current merge window or anything, since v4.19
> > > will notably have this patch:
> > > 
> > > commit d5038a13eca72fb216c07eb717169092e92284f1
> > > Author: Johannes Thumshirn <jthumshirn@suse.de>
> > > Date:   Wed Jul 4 10:53:56 2018 +0200
> > > 
> > >     scsi: core: switch to scsi-mq by default
> > > 
> > >     It has been more than one year since we tried to change the default from
> > >     legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to
> > >     scsi-mq"). But due to issues with suspend/resume and performance problems
> > >     it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default
> > >     to scsi-mq"").
> > > 
> > >     In the meantime there have been a substantial amount of performance
> > >     improvements and suspend/resume got fixed as well, thus we can re-enable
> > >     scsi-mq without a significant performance penalty.
> > > 
> > >     Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> > >     Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >     Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > >     Acked-by: John Garry <john.garry@huawei.com>
> > >     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> > > 
> > > I guess that patch can be a bit scary by itself. But IIUC it all went
> > > fine this time!
> > > 
> > > But hey, if that works, that means $SUBJECT patch will enable BFQ on all
> > > libata devices and any SCSI that is single queue as well, not just
> > > "obscure" stuff like MMC/SD and UBI, and that is
> > > indeed a massive crowd of legacy devices. But we're talking
> > > v4.21 here.
> > > 
> > > Johannes, you might be interested in $SUBJECT patch.
> > > It'd be nice to hear what SUSE people have to add, since they
> > > are pretty proactive in this area.
> > 
> > So we do have a udev rules in our distro which sets the IO scheduler based
> > on device parameters (rotational at least, with blk-mq we might start
> > considering number of queues as well, plus we have some exceptions like
> > virtio, loop, etc.). So the kernel default doesn't concern us too much as a
> > distro.
> > 
> > I personally would consider bfq a safer default for single-queue devices
> > (loop probably needs exception) but I don't feel too strongly about it.
> 
> [Full quote for context]
> 
> What about resurrecting CONFIG_DEFAULT_IOSCHED for MQ as well and
> leave it default to mq-deadline but give bfq, kyber and none as a
> choice as well?

I second this -- introduction of a CONFIG_DEFAULT_MQ_IOSCHED.
Having a default I/O scheduler kernel config option for MQ allows to
build a kernel suitable for specific use w/o userspace
dependencies.
(But it still allows to reconfigure things via userspace.)

> The question is shall we only do it for single queue devices or for
> native MQ devices as well if we go down that road?

Good question. I am not yet sure about this.
I'd start with using the default for single queue devices.

Andreas

> I understand the embedded floks will want a different interface than
> udev, but from the non-embedded point of view I'm with Jens and Jan
> here, let udev do the job.
> 
>       Johannes
> -- 
> Johannes Thumshirn                                          Storage
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Linus Walleij Oct. 4, 2018, 8:25 a.m. UTC | #32
On Wed, Oct 3, 2018 at 1:49 PM Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:

> On another hand, the users of embedded devices, mentioned by Linus,
> should already know what scheduler to choose because dealing with
> embedded world assumes the person can decide this on their own, or with
> the help of abovementioned udev scripts and/or Documentation/ as a
> reference point.
>
> So I see no obstacles here, and the choice to rely on udev by default
> sounds reasonable.

I am sorry but I do not agree with this.

There are several historical precedents where we have
concluded that just "have the kernel do the right thing
by default" is the way to go.

Example 1: pluggable CPU schedulers.
The reasoning was that users or distros have no clue what
scheduler they want, only scheduler developers do. We
drove it to the point where we have one and one
scheduler only, not different flavors. (Special
usecases have special scheduling classes inside the
one scheduler instead.)

Example 2: Automatic process group scheduling
The reasoning was that daemons such as systemd would
be better at placing processes/tasks into the right
control groups to manage their resources, so this would
be a userspace policy handled by the udev/systemd
complex. We did not do that. Instead the kernel does
autogrouping per-session, indeed it is a Kconfig option
but even e.g. Fedora has this enabled by default.
(commit 5091faa449ee)

As pointed out elsewhere: these defaults make it
easy for custom builds not using udev+systemd to
get a system up and running with sensible defaults.

Simple embedded systems use Busybox' mdev (I wouldn't
trust it do do any complex decisions). OpenWRT
has ubox+ubus+uci, also extremely lightweight,
Android has its own init system that I don't
manage to keep track of anymore. Instead of running
all over the map and fixing these userspaces to
do the right thing, it makes sense to make the
right thing the default.

And these are millions and millions of deployed
systems not using udev+systemd we are talking about,
they are not fringe hobby projects. It's not that I
personally dislike udev or anything, I kind of like
it, but these tailored distros simply don't use it
and they are huge in numbers. They need help to do
the right thing. Fixing a udev rule doesn't solve
even half the world's problems I'm afraid.

Yours,
Linus Walleij
Ulf Hansson Oct. 4, 2018, 9:56 a.m. UTC | #33
On 3 October 2018 at 19:34, Bryan Gurney <bgurney@redhat.com> wrote:
> On Wed, Oct 3, 2018 at 11:53 AM, Paolo Valente <paolo.valente@linaro.org> wrote:
>>
>>
>>> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@linaro.org> ha scritto:
>>>
>>> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>>>
>>>> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
>>>> definitely single hw queue). For these, using mq-deadline is mandatory in many
>>>> cases in order to guarantee sequential write command delivery to the device
>>>> driver. Having the default changed to bfq, which as far as I know is not SMR
>>>> friendly (can sequential writes within a single zone be reordered ?) is asking
>>>> for troubles (unaligned write errors showing up).
>>>
>>> Ah, that is interesting.
>>>
>>> Which device driver files are we talking about here, specifically?
>>> I'd like to take a look.
>>>
>>> I guess what you say is not that you are looking for the deadline
>>> scheduling per se (as in deadline scheduling is nice), what you want is
>>> the zone locking semantics in that scheduler, is that right?
>>>
>>> I.e. this business:
>>> blk_queue_is_zoned(q)
>>> blk_req_zone_write_lock(rq);
>>> blk_req_zone_write_unlock(rq);
>>> and mq-deadline solves this with a spinlock.
>>>
>>> I will augment the patch to enforce mq-deadline
>>> if blk_queue_is_zoned(q) is true, as it is clear that
>>> any device with that characteristic must use mq-deadline.
>>>
>>> Paoly might be interested in looking into whether BFQ could
>>> also handle zoned devices in the future, I have no idea of how
>>> hard that would be.
>>>
>>
>> Absolutely, as I already wrote in my reply to Damien.
>>
>> In the meantime, Linus, augmenting your patch as you propose seems
>> a clean and effective solution to me.
>>
>> Thanks,
>> Paolo
>>
>>> The zoned business seems a bit fragile. Should it even be
>>> allowed to select any other scheduler than deadline on these
>>> devices? Presenting all compiled in schedulers in
>>> /sysblock/device/queue/scheduler sounds like just giving
>>> sysadmins too much rope.
>>>
>>> Yours,
>>> Linus Walleij
>>
>
> Right now, users of host-managed SMR drives should be using "deadline"
> or "mq-deadline", to avoid out-of-order writes in sequential-only
> zones.
>
> I'm running into a situation right now on a test system (Fedora 28,
> 4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I
> accidentally forgot to add my "udev rule" file:
>
> # cat /etc/udev/rules.d/99-zoned-block-devices.rules
> ACTION=="add|change", KERNEL=="sd[a-z]",
> ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline"
>
> ...and now, I see these messages when that specific SMR drive is mounted:
>
> kernel: F2FS-fs (sdc): IO Block Size:        4 KB
> kernel: F2FS-fs (sdc): Found nat_bits in checkpoint
> kernel: F2FS-fs (sdc): Mounted with checkpoint version = 212216ab
> kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08),
> sub_code(0x0000)
> kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08),
> sub_code(0x0000)
> kernel: scsi_io_completion: 20 callbacks suppressed
> kernel: sd 7:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : Aborted Command [current]
> kernel: sd 7:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
> kernel: sd 7:0:0:0: [sdb] tag#0 CDB: Write(16) 8a 00 00 00 00 00 3d d4
> ec 99 00 00 00 80 00 00
>
> I was also running into problems with creating new directories on this
> F2FS filesystem.  However, "fsck.f2fs" reports no problems.  So at
> this point, I created a new F2FS filesystem on a second SMR drive, and
> am currently copying the data from the "bad" F2FS filesystem to the
> "good" one.
>
> I wouldn't call zoned block devices "fragile"; they simply have I/O
> rules that didn't previously exist: all writes to sequential-only
> zones must be sequential.  And one of the things that schedulers do is
> reorder writes.  After 4.16, sd stopped being the "gatekeeper" of
> ensuring sequential writes, but the only "zoned-aware" schedulers were
> deadline and mq-deadline.  Since my test system defaulted to "cfq", I
> ran into problems.
>
> So I welcome any changes that make it impossible for the user to
> "accidentally use the wrong scheduler".

I fully agree.

>
> At least this time, I didn't "brick" my test system's BIOS, like I did
> back in May of this year [1].

It sounds to me that the kernel isn't doing its job. In particular,
the kernel have the information, as to be able to select the proper
I/O scheduler (the block layer could just check
BLK_ZONE_TYPE_SEQWRITE_REQ/ZBC_ZONE_TYPE_SEQWRITE_REQ). Instead it
relies on userspace to do the right thing, it can't be right.

Kind regards
Uffe
Mark Brown Oct. 4, 2018, 10:13 a.m. UTC | #34
On Thu, Oct 04, 2018 at 10:14:38AM +0200, Linus Walleij wrote:

> And these are millions and millions of deployed
> systems not using udev+systemd we are talking about,
> they are not fringe hobby projects. It's not that I
> personally dislike udev or anything, I kind of like
> it, but these tailored distros simply don't use it
> and they are huge in numbers. They need help to do
> the right thing. Fixing a udev rule doesn't solve
> even half the world's problems I'm afraid.

Further, even those embedded systems that do use udev (some of them do
lean heavily on modern init stuff like that and systemd, especially when
boot time is a priority) they'll still need to get the relevant udev
rule installed somehow.
Bart Van Assche Oct. 4, 2018, 3:10 p.m. UTC | #35
On Thu, 2018-10-04 at 11:13 +0100, Mark Brown wrote:
> On Thu, Oct 04, 2018 at 10:14:38AM +0200, Linus Walleij wrote:
> 
> > And these are millions and millions of deployed
> > systems not using udev+systemd we are talking about,
> > they are not fringe hobby projects. It's not that I
> > personally dislike udev or anything, I kind of like
> > it, but these tailored distros simply don't use it
> > and they are huge in numbers. They need help to do
> > the right thing. Fixing a udev rule doesn't solve
> > even half the world's problems I'm afraid.
> 
> Further, even those embedded systems that do use udev (some of them do
> lean heavily on modern init stuff like that and systemd, especially when
> boot time is a priority) they'll still need to get the relevant udev
> rule installed somehow.

Hi Mark,

Are you aware that the systemd source tree includes a set of udev rules?
See also https://github.com/systemd/systemd/tree/master/rules.

Bart.
Mark Brown Oct. 4, 2018, 3:26 p.m. UTC | #36
On Thu, Oct 04, 2018 at 08:10:57AM -0700, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 11:13 +0100, Mark Brown wrote:

> > Further, even those embedded systems that do use udev (some of them do
> > lean heavily on modern init stuff like that and systemd, especially when
> > boot time is a priority) they'll still need to get the relevant udev
> > rule installed somehow.

> Are you aware that the systemd source tree includes a set of udev rules?
> See also https://github.com/systemd/systemd/tree/master/rules.

Yeah, but then you're back to the situation where someone needs to go
pick up a new version of systemd to get the new rules along with the new
kernel.  It's not insurmountable but it's an obstacle.
Bart Van Assche Oct. 4, 2018, 8:09 p.m. UTC | #37
On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:
> > I agree with Jens that it's best to leave it to the Linux distributors to
> > select a default I/O scheduler.
> 
> That assumes such a thing exists. The kernel knows what devices it is
> dealing with. The kernel 'default' ought to be 'whatever is usually best
> for this device'. A distro cannot just pick a correct single default
> because NVME and USB sticks are both normal and rather different in needs.

Which I/O scheduler works best also depends which workload the user will run.
BFQ has significant advantages for interactive workloads like video replay
with concurrent background I/O but probably slows down kernel builds. That's
why I'm not sure whether the kernel should select the default I/O scheduler.

Bart.
Paolo Valente Oct. 4, 2018, 8:19 p.m. UTC | #38
> Il giorno 04 ott 2018, alle ore 21:25, Alan Cox <gnomes@lxorguk.ukuu.org.uk> ha scritto:
> 
>> I agree with Jens that it's best to leave it to the Linux distributors to
>> select a default I/O scheduler.
> 
> That assumes such a thing exists.

Well, as of now the default is more or less in the Schrdinger's cat state :)

Metaphors apart, I do agree with your point.  As for me, what you
point out is one of the core issues at stake here.

Thanks,
Paolo

> The kernel knows what devices it is
> dealing with. The kernel 'default' ought to be 'whatever is usually best
> for this device'. A distro cannot just pick a correct single default
> because NVME and USB sticks are both normal and rather different in needs.
> 
> Alan
Paolo Valente Oct. 4, 2018, 8:39 p.m. UTC | #39
> Il giorno 04 ott 2018, alle ore 22:09, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:
>>> I agree with Jens that it's best to leave it to the Linux distributors to
>>> select a default I/O scheduler.
>> 
>> That assumes such a thing exists. The kernel knows what devices it is
>> dealing with. The kernel 'default' ought to be 'whatever is usually best
>> for this device'. A distro cannot just pick a correct single default
>> because NVME and USB sticks are both normal and rather different in needs.
> 
> Which I/O scheduler works best also depends which workload the user will run.
> BFQ has significant advantages for interactive workloads like video replay
> with concurrent background I/O but probably slows down kernel builds.

No, kernel build is, for evident reasons, one of the workloads I cared
most about.  Actually, I tried to focus on all my main
kernel-development tasks, such as also git checkout, git merge, git
grep, ...

According to my test results, with BFQ these tasks are at least as
fast as, or, in most system configurations, much faster than with the
other schedulers.  Of course, at the same time the system also remains
responsive with BFQ.

You can repeat these tests using one of my first scripts in the S
suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
hypertrophied the names I gave :) ).

I stopped sharing also my kernel-build results years ago, because I
went on obtaining the same, identical good results for years, and I'm
aware that I tend to show and say too much stuff.

Thanks,
Paolo

> That's
> why I'm not sure whether the kernel should select the default I/O scheduler.
> 
> Bart.
Bart Van Assche Oct. 4, 2018, 10:42 p.m. UTC | #40
On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:
> No, kernel build is, for evident reasons, one of the workloads I cared
> most about.  Actually, I tried to focus on all my main
> kernel-development tasks, such as also git checkout, git merge, git
> grep, ...
> 
> According to my test results, with BFQ these tasks are at least as
> fast as, or, in most system configurations, much faster than with the
> other schedulers.  Of course, at the same time the system also remains
> responsive with BFQ.
> 
> You can repeat these tests using one of my first scripts in the S
> suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
> hypertrophied the names I gave :) ).
> 
> I stopped sharing also my kernel-build results years ago, because I
> went on obtaining the same, identical good results for years, and I'm
> aware that I tend to show and say too much stuff.

On my test setup building the kernel is slightly slower when using the BFQ
scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,
single CPU with 6 cores, hyperthreading disabled). I am aware that the
proposal at the start of this thread was to make BFQ the default for devices
with a single hardware queue and not for devices like NVMe SSDs that support
multiple hardware queues.

What I think is missing is measurement results for BFQ on a system with
multiple CPU sockets and against a fast storage medium. Eliminating
the host lock from the SCSI core yielded a significant performance
improvement for such storage devices. Since the BFQ scheduler locks and
unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
will slow down I/O for fast storage devices, even if their driver only
creates a single hardware queue.

Bart.
Christoph Hellwig Oct. 5, 2018, 6:24 a.m. UTC | #41
On Wed, Oct 03, 2018 at 08:15:24AM -0700, Bart Van Assche wrote:
> On Wed, 2018-10-03 at 08:01 -0700, Christoph Hellwig wrote:
> > On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote:
> > > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous
> > > writes already started?
> > 
> > No, and I don't know of anyone who wants to do that in the short term.
> 
> That's unfortunate. I think having such a command available in the SCSI
> command set would be a step forward.

I'm not saying it doesn't make sense, only that I don't know of any short
term plans.
Artem Bityutskiy Oct. 5, 2018, 6:24 a.m. UTC | #42
On Thu, 2018-10-04 at 13:09 -0700, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:
> > > I agree with Jens that it's best to leave it to the Linux distributors to
> > > select a default I/O scheduler.
> > 
> > That assumes such a thing exists. The kernel knows what devices it is
> > dealing with. The kernel 'default' ought to be 'whatever is usually best
> > for this device'. A distro cannot just pick a correct single default
> > because NVME and USB sticks are both normal and rather different in needs.
> 
> Which I/O scheduler works best also depends which workload the user will run.
> BFQ has significant advantages for interactive workloads like video replay
> with concurrent background I/O but probably slows down kernel builds. That's
> why I'm not sure whether the kernel should select the default I/O scheduler.

Whats wrong with this simple hierarchy?

1. Block core selects the default scheduler.
2. Driver can overrule it early.
3. Userspace can overrule the default later.

Everyone is happy.

Good defaults in block core are great. Those defaults + #3 may cover
99% of the population.

1% of the population can use #2. See, Linus wants "bfq" for ubiblock.
Why wouldn't we to let him work with UBI community, show that bfq is 
best for ubiblock, and just let the UBI community overrule the block
core's default.

If some day in the future there is a very good reason, we can even make
this to be a module parameter, and people could just boot with
'ubiblock.iosched=bfq'.
Pavel Machek Oct. 5, 2018, 8:04 a.m. UTC | #43
Hi!

> I talked to Pavel a bit back and it turns out he has a
> usecase for BFQ as well and I bet he also would like it
> as default scheduler for that system (Pavel tell us more,
> I don't remember what it was!)

I'm not sure I remember clearly, either.

IIRC I was working with ionice on spinning disks, and it had no
effect. I switched to BFQ and suddenly ionice was effective.

Best regards,
									Pavel
Jan Kara Oct. 5, 2018, 9:16 a.m. UTC | #44
On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:
> > No, kernel build is, for evident reasons, one of the workloads I cared
> > most about.  Actually, I tried to focus on all my main
> > kernel-development tasks, such as also git checkout, git merge, git
> > grep, ...
> > 
> > According to my test results, with BFQ these tasks are at least as
> > fast as, or, in most system configurations, much faster than with the
> > other schedulers.  Of course, at the same time the system also remains
> > responsive with BFQ.
> > 
> > You can repeat these tests using one of my first scripts in the S
> > suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
> > hypertrophied the names I gave :) ).
> > 
> > I stopped sharing also my kernel-build results years ago, because I
> > went on obtaining the same, identical good results for years, and I'm
> > aware that I tend to show and say too much stuff.
> 
> On my test setup building the kernel is slightly slower when using the BFQ
> scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,
> single CPU with 6 cores, hyperthreading disabled). I am aware that the
> proposal at the start of this thread was to make BFQ the default for devices
> with a single hardware queue and not for devices like NVMe SSDs that support
> multiple hardware queues.
> 
> What I think is missing is measurement results for BFQ on a system with
> multiple CPU sockets and against a fast storage medium. Eliminating
> the host lock from the SCSI core yielded a significant performance
> improvement for such storage devices. Since the BFQ scheduler locks and
> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
> will slow down I/O for fast storage devices, even if their driver only
> creates a single hardware queue.

Well, I'm not sure why that is missing. I don't think anyone proposed to
default to BFQ for such setup? Neither was anyone claiming that BFQ is
better in such situation... The proposal has been: Default to BFQ for slow
storage, leave it to deadline-mq otherwise.

								Honza
Paolo Valente Oct. 5, 2018, 9:28 a.m. UTC | #45
> Il giorno 05 ott 2018, alle ore 00:42, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:
>> No, kernel build is, for evident reasons, one of the workloads I cared
>> most about.  Actually, I tried to focus on all my main
>> kernel-development tasks, such as also git checkout, git merge, git
>> grep, ...
>> 
>> According to my test results, with BFQ these tasks are at least as
>> fast as, or, in most system configurations, much faster than with the
>> other schedulers.  Of course, at the same time the system also remains
>> responsive with BFQ.
>> 
>> You can repeat these tests using one of my first scripts in the S
>> suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
>> hypertrophied the names I gave :) ).
>> 
>> I stopped sharing also my kernel-build results years ago, because I
>> went on obtaining the same, identical good results for years, and I'm
>> aware that I tend to show and say too much stuff.
> 
> On my test setup building the kernel is slightly slower when using the BFQ
> scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,
> single CPU with 6 cores, hyperthreading disabled). I am aware that the
> proposal at the start of this thread was to make BFQ the default for devices
> with a single hardware queue and not for devices like NVMe SSDs that support
> multiple hardware queues.
> 

I miss your point: as you yourself note, the proposal is limited to
single-queue devices, exactly because BFQ is not ready for
multiple-queue devices yet.

> What I think is missing is measurement results for BFQ on a system with
> multiple CPU sockets and against a fast storage medium.

It is not missing.  As I happened to report in previous threads, we
made a script to measure that too [1], using fio and null block.

I have reported the results we obtained, for three classes of
processors, in the in-kernel BFQ documentation [2].

In particular, BFQ reached 400KIOPS with the fastest CPU mentioned in
that document (Intel i7-4850HQ).

So, since the speed of that single-socket commodity CPU is most likely
lower than the total speed of a multi-socket system, we have that, on
such a system and with BFQ, you should be conservatively ok with
single-queue devices in the range 300-500 KIOPS.

[1] https://github.com/Algodev-github/IOSpeed
[2] https://www.kernel.org/doc/Documentation/block/bfq-iosched.txt

>  

> Eliminating
> the host lock from the SCSI core yielded a significant performance
> improvement for such storage devices. Since the BFQ scheduler locks and
> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
> will slow down I/O for fast storage devices, even if their driver only
> creates a single hardware queue.
> 

One of the main motivations behind NVMe, and blk-mq itself, is that it
is hard to reach the above IOPS, and more, with a single I/O queue as
bottleneck.

So, I wouldn't expect that systems
- equipped with single-queue drives reaching more than 500 KIOPS
- using SATA or some other non-NVMe as protocol
- so fast to push these drives to their maximum speeds
constitute more than a negligible percentage of devices.

So, by sticking to mq-deadline, we would sacrifice 99% of systems, to
make sure, basically, that those very few systems on steroids reach
maximum throughput with random I/O (while however still suffering from
responsiveness problems).  I think it makes much more sense to have as
default what is best for 99% of the single-queue systems, with those
super systems properly reconfigured by their users.  For sure, other
defaults are to be changed too, to get the most out of those systems.

Thanks,
Paolo


> Bart.
Pavel Machek Oct. 5, 2018, 9:49 a.m. UTC | #46
Hi!

> On another hand, the users of embedded devices, mentioned by Linus,
> > should already know what scheduler to choose because dealing with
> > embedded world assumes the person can decide this on their own, or with
> > the help of abovementioned udev scripts and/or Documentation/ as a
> > reference point.
> >
> > So I see no obstacles here, and the choice to rely on udev by default
> > sounds reasonable.
> >
> 
> I am sorry but I do not agree with this.
> 
> There are several historical precedents where we have
> concluded that just "have the kernel do the right thing
> by default" is the way to go.

Kernel should do the right thing by default, I agree with Linus W.

Having reasonable defaults is useful; yes, my desktop has udev etc,
but I still want reasonable scheduler when doing fsck in
init=/bin/bash.

Plus, I update kernels more often than distros, and I run various
embedded stuff.

Kernel should just provide reasonable defaults.

And yes, we have "ionice" command and yes, it would be nice if it
worked by default...

									Pavel
Bart Van Assche Oct. 6, 2018, 3:12 a.m. UTC | #47
On 10/5/18 2:16 AM, Jan Kara wrote:
> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>> What I think is missing is measurement results for BFQ on a system with
>> multiple CPU sockets and against a fast storage medium. Eliminating
>> the host lock from the SCSI core yielded a significant performance
>> improvement for such storage devices. Since the BFQ scheduler locks and
>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>> will slow down I/O for fast storage devices, even if their driver only
>> creates a single hardware queue.
> 
> Well, I'm not sure why that is missing. I don't think anyone proposed to
> default to BFQ for such setup? Neither was anyone claiming that BFQ is
> better in such situation... The proposal has been: Default to BFQ for slow
> storage, leave it to deadline-mq otherwise.

Hi Jan,

How do you define slow storage? The proposal at the start of this thread 
was to make BFQ the default for all block devices that create a single 
hardware queue. That includes all SATA storage since scsi-mq only 
creates a single hardware queue when using the SATA protocol. The 
proposal to make BFQ the default for systems with a single hard disk 
probably makes sense but I am not sure that making BFQ the default for 
systems equipped with one or more (SATA) SSDs is also a good idea. 
Especially for multi-socket systems since BFQ reintroduces a queue-wide 
lock. As you know no queue-wide locking happens during I/O in the 
scsi-mq core nor in the blk-mq core.

Bart.
Paolo Valente Oct. 6, 2018, 6:46 a.m. UTC | #48
> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On 10/5/18 2:16 AM, Jan Kara wrote:
>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>>> What I think is missing is measurement results for BFQ on a system with
>>> multiple CPU sockets and against a fast storage medium. Eliminating
>>> the host lock from the SCSI core yielded a significant performance
>>> improvement for such storage devices. Since the BFQ scheduler locks and
>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>>> will slow down I/O for fast storage devices, even if their driver only
>>> creates a single hardware queue.
>> Well, I'm not sure why that is missing. I don't think anyone proposed to
>> default to BFQ for such setup? Neither was anyone claiming that BFQ is
>> better in such situation... The proposal has been: Default to BFQ for slow
>> storage, leave it to deadline-mq otherwise.
> 
> Hi Jan,
> 
> How do you define slow storage? The proposal at the start of this thread was to make BFQ the default for all block devices that create a single hardware queue. That includes all SATA storage since scsi-mq only creates a single hardware queue when using the SATA protocol. The proposal to make BFQ the default for systems with a single hard disk probably makes sense but I am not sure that making BFQ the default for systems equipped with one or more (SATA) SSDs is also a good idea. Especially for multi-socket systems since BFQ reintroduces a queue-wide lock.

No, BFQ has no queue-wide lock.  The very first change made to BFQ for
porting it to blk-mq was to remove the queue lock.  Guided by Jens, I
replaced that lock with the exact, same scheduler lock used in
mq-deadline.

Thanks,
Paolo

> As you know no queue-wide locking happens during I/O in the scsi-mq core nor in the blk-mq core.
> 
> Bart.
Bart Van Assche Oct. 6, 2018, 4:20 p.m. UTC | #49
On 10/5/18 11:46 PM, Paolo Valente wrote:
>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto:
>> On 10/5/18 2:16 AM, Jan Kara wrote:
>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>>>> What I think is missing is measurement results for BFQ on a system with
>>>> multiple CPU sockets and against a fast storage medium. Eliminating
>>>> the host lock from the SCSI core yielded a significant performance
>>>> improvement for such storage devices. Since the BFQ scheduler locks and
>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>>>> will slow down I/O for fast storage devices, even if their driver only
>>>> creates a single hardware queue.
>>> Well, I'm not sure why that is missing. I don't think anyone proposed to
>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is
>>> better in such situation... The proposal has been: Default to BFQ for slow
>>> storage, leave it to deadline-mq otherwise.
>>
>> How do you define slow storage? The proposal at the start of this thread
>> was to make BFQ the default for all block devices that create a single
>> hardware queue. That includes all SATA storage since scsi-mq only creates
>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense
>> but I am not sure that making BFQ the default for systems equipped with
>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket
>> systems since BFQ reintroduces a queue-wide lock.
> 
> No, BFQ has no queue-wide lock.  The very first change made to BFQ for
> porting it to blk-mq was to remove the queue lock.  Guided by Jens, I
> replaced that lock with the exact, same scheduler lock used in
> mq-deadline.

It's easy to see that both mq-deadline and BFQ define a queue-wide lock. 
For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That 
last lock serializes all bfq_dispatch_request() calls and hence reduces 
concurrency while processing I/O requests. From bfq_dispatch_request():

static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
	[ ... ]
	spin_lock_irq(&bfqd->lock);
	[ ... ]
}

I think the above makes it very clear that bfqd->lock is queue-wide.

It is easy to understand why both I/O schedulers need a queue-wide lock: 
the only way to avoid race conditions when considering all pending I/O 
requests for scheduling decisions is to use a lock that covers all 
pending requests and hence that is queue-wide.

Bart.
Paolo Valente Oct. 6, 2018, 4:46 p.m. UTC | #50
> Il giorno 06 ott 2018, alle ore 18:20, Bart Van Assche <bvanassche@acm.org> ha scritto:
> 
> On 10/5/18 11:46 PM, Paolo Valente wrote:
>>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto:
>>> On 10/5/18 2:16 AM, Jan Kara wrote:
>>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>>>>> What I think is missing is measurement results for BFQ on a system with
>>>>> multiple CPU sockets and against a fast storage medium. Eliminating
>>>>> the host lock from the SCSI core yielded a significant performance
>>>>> improvement for such storage devices. Since the BFQ scheduler locks and
>>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>>>>> will slow down I/O for fast storage devices, even if their driver only
>>>>> creates a single hardware queue.
>>>> Well, I'm not sure why that is missing. I don't think anyone proposed to
>>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is
>>>> better in such situation... The proposal has been: Default to BFQ for slow
>>>> storage, leave it to deadline-mq otherwise.
>>> 
>>> How do you define slow storage? The proposal at the start of this thread
>>> was to make BFQ the default for all block devices that create a single
>>> hardware queue. That includes all SATA storage since scsi-mq only creates
>>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense
>>> but I am not sure that making BFQ the default for systems equipped with
>>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket
>>> systems since BFQ reintroduces a queue-wide lock.
>> No, BFQ has no queue-wide lock.  The very first change made to BFQ for
>> porting it to blk-mq was to remove the queue lock.  Guided by Jens, I
>> replaced that lock with the exact, same scheduler lock used in
>> mq-deadline.
> 
> It's easy to see that both mq-deadline and BFQ define a queue-wide lock. For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That last lock serializes all bfq_dispatch_request() calls and hence reduces concurrency while processing I/O requests. From bfq_dispatch_request():
> 
> static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
> {
> 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
> 	[ ... ]
> 	spin_lock_irq(&bfqd->lock);
> 	[ ... ]
> }
> 
> I think the above makes it very clear that bfqd->lock is queue-wide.
> 
> It is easy to understand why both I/O schedulers need a queue-wide lock: the only way to avoid race conditions when considering all pending I/O requests for scheduling decisions is to use a lock that covers all pending requests and hence that is queue-wide.
> 

Absolutely true.  Queue lock is evidently a very general concept, and
a lock on a scheduler is, in the end, a lock on its internal queue(s).
But the queue lock removed by blk-mq is not that small per-scheduler
lock, but the big, single-request-queue lock.  The effects of the
latter are probably almost one order of magnitude higher than those of
a scheduler lock, even with a non-trivial scheduler like BFQ.

As a simple concrete proof of this fact, consider the numbers that I
already gave you, and that you can re-obtain in five minutes: on a
laptop, BFQ may support up to 400KIOPS.  Probably, even just with noop
as I/O scheduler, the same PC cannot process so many IOPS with legacy
blk (because of the single-request-queue lock).

To sum up, in your argument you mixed two different locks.

Anyway, you are going very deep in this issue.  This takes you very
close to what I'm currently working on (still in a design phase):
increasing the parallel efficiency of BFQ, mainly by reducing the
duration of the pieces of BFQ executed under its scheduler lock.

But the goal of such a non-trivial improvement is to go from the
current 400 KIOPS to more than one million of IOPS.  This is an
improvement that will most likely provide no benefits for probably 99%
of the systems with single-queue devices.  Those systems simply do no go
beyond 300 KIOPS.

So, I'm trying to first devote my limited single-person bandwidth
(sorry, I didn't resist the temptation to joke on this growing
discussion on single-something issues :) ) to improvements that make
BFQ better within its current hardware scope.

Thanks,
Paolo

> Bart.
diff mbox series

Patch

diff --git a/block/elevator.c b/block/elevator.c
index e18ac68626e3..e5a2c39eee7b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,15 @@  int elevator_switch_mq(struct request_queue *q,
 }
 
 /*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices.  If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * For blk-mq devices, we default to using:
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
  */
 int elevator_init_mq(struct request_queue *q)
 {
-	struct elevator_type *e;
+	struct elevator_type *e = NULL;
 	int err = 0;
 
 	if (q->nr_hw_queues != 1)
@@ -968,9 +970,14 @@  int elevator_init_mq(struct request_queue *q)
 	if (unlikely(q->elevator))
 		goto out_unlock;
 
-	e = elevator_get(q, "mq-deadline", false);
-	if (!e)
-		goto out_unlock;
+	if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+		e = elevator_get(q, "bfq", false);
+
+	if (!e) {
+		e = elevator_get(q, "mq-deadline", false);
+		if (!e)
+			goto out_unlock;
+	}
 
 	err = blk_mq_init_sched(q, e);
 	if (err)