Message ID | 20181002124329.21248-1-linus.walleij@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: BFQ default for single queue devices | expand |
On 10/2/18 6:43 AM, Linus Walleij wrote: > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but notably also UBI and > the loopback device. > > I have been running it for a while without any negative > effects on my pet systems and I want some wider testing > so let's throw it out there and see what people say. > Admittedly my use cases are limited. > > I talked to Pavel a bit back and it turns out he has a > usecase for BFQ as well and I bet he also would like it > as default scheduler for that system (Pavel tell us more, > I don't remember what it was!) > > Intuitively I could understand that maybe we want to > leave the loop device (possibly others? nbd? rbd?) as > "none", as it is probably relying on a scheduler on the > device below it, so I'm open to passing in a scheduler hint > from the respective subsystem in say struct blk_mq_tag_set. > However that makes for a bit of syntactic dissonance > with the struct member ".nr_hw_queues" (I wonder how > the loop device can have 1 "hardware queue"?) so > maybe we should in that case also rename that struct > member to ".nr_queues" fair and square before we start > making adjustments for treating queues differently whether > they are in hardware or actually not. I think this should just be done with udev rules, and I'd prefer if the distros would lead the way on this, as they are the ones that will most likely see the most bug reports on a change like this.
On Tue, Oct 2, 2018 at 4:31 PM Jens Axboe <axboe@kernel.dk> wrote: > On 10/2/18 6:43 AM, Linus Walleij wrote: > > This sets BFQ as the default scheduler for single queue > > block devices (nr_hw_queues == 1) if it is available. This > > affects notably MMC/SD-cards but notably also UBI and > > the loopback device. > > I think this should just be done with udev rules, and I'd > prefer if the distros would lead the way on this, as they > are the ones that will most likely see the most bug reports > on a change like this. AFAICT there is no sysfs property that states how many hw queues the device has. And what we want to do is activate BFQ when there is one HW queue. Should I make a patch to add a nr_hw_queues sysfs file for this purpose in that case? That will be a slightly misleading file for loop or networked devices. udev is a way to do this with desktop/server distros that has "standard" (as they think about it) userspace. They can even do it from their initrd/initramfs to mount root using BFQ I guess (quick handover from e.g. UEFI). However this is not a very good fit with Embedded systems, as they tend to be minimal, not use udev (e.g. Android, OpenWRT, busybox-derivates...) they don't do udev rules, but I guess they can in theory do other scripts. But they will mount root before anything like that can happen. They don't use initrd/initramfs. What I want to achieve is to mount my rootfs with BFQ but that is not possible on embedded systems that do not use initramfs, e.g. a rootfs on MMC/SD or UBI. Yours, Linus Walleij
Linus, Am Dienstag, 2. Oktober 2018, 14:43:29 CEST schrieb Linus Walleij: > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but notably also UBI and > the loopback device. did you notice a difference for UBI? Strictly speaking it affects only ubibock, the read-only block device on top of an UBI volume. Thanks, //richard
> Il giorno 02 ott 2018, alle ore 16:31, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 10/2/18 6:43 AM, Linus Walleij wrote: >> This sets BFQ as the default scheduler for single queue >> block devices (nr_hw_queues == 1) if it is available. This >> affects notably MMC/SD-cards but notably also UBI and >> the loopback device. >> >> I have been running it for a while without any negative >> effects on my pet systems and I want some wider testing >> so let's throw it out there and see what people say. >> Admittedly my use cases are limited. >> >> I talked to Pavel a bit back and it turns out he has a >> usecase for BFQ as well and I bet he also would like it >> as default scheduler for that system (Pavel tell us more, >> I don't remember what it was!) >> >> Intuitively I could understand that maybe we want to >> leave the loop device (possibly others? nbd? rbd?) as >> "none", as it is probably relying on a scheduler on the >> device below it, so I'm open to passing in a scheduler hint >> from the respective subsystem in say struct blk_mq_tag_set. >> However that makes for a bit of syntactic dissonance >> with the struct member ".nr_hw_queues" (I wonder how >> the loop device can have 1 "hardware queue"?) so >> maybe we should in that case also rename that struct >> member to ".nr_queues" fair and square before we start >> making adjustments for treating queues differently whether >> they are in hardware or actually not. > > I think this should just be done with udev rules, and I'd > prefer if the distros would lead the way on this, as they > are the ones that will most likely see the most bug reports > on a change like this. > Hi Jens, I see your point, but I doubt this is the way to go, because of the following flaws. As also Linus Torvalds complained [1], people feel lost among I/O-scheduler options. Actual differences across I/O schedulers are basically obscure to non experts. In this respect, Linux-kernel 'users' are way more than a few top-level distros that can afford a strong performance team, and that, basing on the input of such a team, might venture light-heartedly to change a critical component like an I/O scheduler. Plus, as Linus Walleij pointed out, some users simply are not distros that use udev. So, probably 99% of Linux-kernel users will just stick to the default I/O scheduler, mq-deadline, assuming that the algorithm by which that scheduler was chosen was not "pick the scheduler with the longest name", but "pick the best scheduler for most cases". The problem is that, for single-queue devices with a speed below 400/500 KIOPS, the default scheduler is apparently incomparably worse than bfq in terms of responsiveness and latency for time-sensitive applications [2], and in terms of throughput reached while controlling I/O [3]. And, in all other tests ran so far, by any entity or group I'm aware of, bfq results basically on par with or better than mq-deadline. So, I do understand your need for conservativeness, but, after so much evidence on single-queue devices, and so many years! :), what's the point in keeping Linux worse for virtually everybody, by default? Thanks, Paolo [1] https://lkml.org/lkml/2017/2/21/791 [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php [3] https://lwn.net/Articles/763603/ > -- > Jens Axboe
On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote: > So, I do understand your need for conservativeness, but, after so much > evidence on single-queue devices, and so many years! :), what's the > point in keeping Linux worse for virtually everybody, by default? I understand if we need to ease things in as well, I don't intend this change for the current merge window or anything, since v4.19 will notably have this patch: commit d5038a13eca72fb216c07eb717169092e92284f1 Author: Johannes Thumshirn <jthumshirn@suse.de> Date: Wed Jul 4 10:53:56 2018 +0200 scsi: core: switch to scsi-mq by default It has been more than one year since we tried to change the default from legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to scsi-mq"). But due to issues with suspend/resume and performance problems it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default to scsi-mq""). In the meantime there have been a substantial amount of performance improvements and suspend/resume got fixed as well, thus we can re-enable scsi-mq without a significant performance penalty. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> I guess that patch can be a bit scary by itself. But IIUC it all went fine this time! But hey, if that works, that means $SUBJECT patch will enable BFQ on all libata devices and any SCSI that is single queue as well, not just "obscure" stuff like MMC/SD and UBI, and that is indeed a massive crowd of legacy devices. But we're talking v4.21 here. Johannes, you might be interested in $SUBJECT patch. It'd be nice to hear what SUSE people have to add, since they are pretty proactive in this area. Yours, Linus Walleij
On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: > So, I do understand your need for conservativeness, but, after so much > evidence on single-queue devices, and so many years! :), what's the > point in keeping Linux worse for virtually everybody, by default? Sounds like what we just need a mechanism for the device (ubi block in this case) to select the I/O scheduler. I doubt enhancing the default scheduler selection logic in 'elevator.c' is the right answer. Just give the driver authority to override the defaults.
On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote: > On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: > > So, I do understand your need for conservativeness, but, after so much > > evidence on single-queue devices, and so many years! :), what's the > > point in keeping Linux worse for virtually everybody, by default? > > Sounds like what we just need a mechanism for the device (ubi block in > this case) to select the I/O scheduler. I doubt enhancing the default > scheduler selection logic in 'elevator.c' is the right answer. Just > give the driver authority to override the defaults. This might be true in the wider sense (like for what scheduler to select for an NVME device with N channels) but $SUBJECT is just trying to select BFQ (if available) for devices with one and only one hardware queue. That is AFAICT the only reasonable choice for anything with just one hardware queue as things stand right now. I have a slight reservation for the weird outliers like loopdev, which has "one hardware queue" (.nr_hw_queues == 1) though this makes no sense at all. So I would like to know what people think about that. Maybe we should have .nr_queues and .nr_hw_queues where the former is the number of logical queues and the latter the actual number of hardware queues. Yours, Linus Walleij
On 2018/10/03 16:18, Linus Walleij wrote: > On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote: >> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: >>> So, I do understand your need for conservativeness, but, after so much >>> evidence on single-queue devices, and so many years! :), what's the >>> point in keeping Linux worse for virtually everybody, by default? >> >> Sounds like what we just need a mechanism for the device (ubi block in >> this case) to select the I/O scheduler. I doubt enhancing the default >> scheduler selection logic in 'elevator.c' is the right answer. Just >> give the driver authority to override the defaults. > > This might be true in the wider sense (like for what scheduler to > select for an NVME device with N channels) but $SUBJECT is just > trying to select BFQ (if available) for devices with one and only one > hardware queue. > > That is AFAICT the only reasonable choice for anything with just > one hardware queue as things stand right now. > > I have a slight reservation for the weird outliers like loopdev, which > has "one hardware queue" (.nr_hw_queues == 1) though this > makes no sense at all. So I would like to know what people think > about that. Maybe we should have .nr_queues and .nr_hw_queues > where the former is the number of logical queues and the latter > the actual number of hardware queues. There is another class of outliers: host-managed SMR disks (SATA and SCSI, definitely single hw queue). For these, using mq-deadline is mandatory in many cases in order to guarantee sequential write command delivery to the device driver. Having the default changed to bfq, which as far as I know is not SMR friendly (can sequential writes within a single zone be reordered ?) is asking for troubles (unaligned write errors showing up). A while back, we already had this discussion with Jens and Christoph on the list to allow device drivers to set a sensible default I/O scheduler for devices with "special needs" (e.g. host-managed SMR). At the time, the conclusion was that udev (or something alike in userland) is better suited to set a correct scheduler. Of note also is that host-managed like sequential zone devices are also likely to show up soon with the work being done in the NVMe standard on the new "Zoned namespace" feature proposal. These devices will also require a scheduler like mq-deadline guaranteeing per-zone in-order delivery of sequential write requests. Looking only at the number of queues of the device is not enough to choose the best (most reasonnable/appropriate) scheduler.
On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > There is another class of outliers: host-managed SMR disks (SATA and SCSI, > definitely single hw queue). For these, using mq-deadline is mandatory in many > cases in order to guarantee sequential write command delivery to the device > driver. Having the default changed to bfq, which as far as I know is not SMR > friendly (can sequential writes within a single zone be reordered ?) is asking > for troubles (unaligned write errors showing up). Ah, that is interesting. Which device driver files are we talking about here, specifically? I'd like to take a look. I guess what you say is not that you are looking for the deadline scheduling per se (as in deadline scheduling is nice), what you want is the zone locking semantics in that scheduler, is that right? I.e. this business: blk_queue_is_zoned(q) blk_req_zone_write_lock(rq); blk_req_zone_write_unlock(rq); and mq-deadline solves this with a spinlock. I will augment the patch to enforce mq-deadline if blk_queue_is_zoned(q) is true, as it is clear that any device with that characteristic must use mq-deadline. Paoly might be interested in looking into whether BFQ could also handle zoned devices in the future, I have no idea of how hard that would be. The zoned business seems a bit fragile. Should it even be allowed to select any other scheduler than deadline on these devices? Presenting all compiled in schedulers in /sysblock/device/queue/scheduler sounds like just giving sysadmins too much rope. Yours, Linus Walleij
Linus, On 2018/10/03 17:28, Linus Walleij wrote: > On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > >> There is another class of outliers: host-managed SMR disks (SATA and SCSI, >> definitely single hw queue). For these, using mq-deadline is mandatory in many >> cases in order to guarantee sequential write command delivery to the device >> driver. Having the default changed to bfq, which as far as I know is not SMR >> friendly (can sequential writes within a single zone be reordered ?) is asking >> for troubles (unaligned write errors showing up). > > Ah, that is interesting. > > Which device driver files are we talking about here, specifically? > I'd like to take a look. Currently, sd.c (SCSI disk) as well as null_blk can expose host-managed zoned block devices. > I guess what you say is not that you are looking for the deadline > scheduling per se (as in deadline scheduling is nice), what you want is > the zone locking semantics in that scheduler, is that right? Yes, correct. The scheduling policy in itself does not really matter, but should not deviate from the mandatory HM write policy: "within a sequential write required zone, writes must be issued sequentially". That could somewhat impacts the scheduler code itself if said scheduler think that not dispatching sequential writes in sequence is a good idea :) No sane scheduler would that though (at least on HDDs) so the impact on the scheduler code itself is reduced. > I.e. this business: > blk_queue_is_zoned(q) > blk_req_zone_write_lock(rq); > blk_req_zone_write_unlock(rq); > and mq-deadline solves this with a spinlock. Yes. These are the helper functions handling the zone write locking to simplify the task of the scheduler to limit the number of in-flight write request to one per zone at most at any time. This is the trick to avoid write reordering stack-wide, since most of the time, reordering happens not because of the scheduler itself, but the blk-mq (or legacy path) around it (e.g. requeue due to resource shortage, multiple contexts running the queues, etc). > I will augment the patch to enforce mq-deadline > if blk_queue_is_zoned(q) is true, as it is clear that > any device with that characteristic must use mq-deadline. > > Paoly might be interested in looking into whether BFQ could > also handle zoned devices in the future, I have no idea of how > hard that would be. It was rather easy with deadline, but the scheduler code was simple to start with. Basically, the only thing needed is on dispatch to skip any write request to a zone that is already locked (i.e. a write is already ongoing). For reads, there are no constraints so nothing needs to be changed. Zone unlocking must be done on completion of the write request so the scheduler completion method needs to change a little too. > The zoned business seems a bit fragile. Should it even be > allowed to select any other scheduler than deadline on these > devices? Presenting all compiled in schedulers in > /sysblock/device/queue/scheduler sounds like just giving > sysadmins too much rope. Yes, that is debatable. But the above "one write per zone" trick can be also handled by an application, which makes using any other scheduler OK. Look at the recent SMR support code in fio from Bart. Some of the tests (in t/zbd) for some I/O patterns are just fine with any scheduler. In fact any pattern is OK because fio is SMR aware and never issues more than one write per zone. That is however true only and only for I/O sizes that are small enough as to not cause the kernel to generate multiple BIOs for each I/O call. Otherwise, deadline & mq-deadline become necessary. But I agree, it is a little fragile. Application developer and sysadmins really need to know what will be running on the disk to make the right choice. And knowing that is not necessarily straightforward. Best regards.
Hi. On 03.10.2018 08:29, Paolo Valente wrote: > As also Linus Torvalds complained [1], people feel lost among > I/O-scheduler options. Actual differences across I/O schedulers are > basically obscure to non experts. In this respect, Linux-kernel > 'users' are way more than a few top-level distros that can afford a > strong performance team, and that, basing on the input of such a team, > might venture light-heartedly to change a critical component like an > I/O scheduler. Plus, as Linus Walleij pointed out, some users simply > are not distros that use udev. I feel a contradiction in this counter-argument. On one hand, there are lots of, let's call them, home users, that use major distributions with udev, so the distribution maintainers can reasonably decide which scheduler to use for which type of device based on the udev rule and common sense provided via Documentation/ by linux-block devs. Moreover, most likely, those rules should be similar or the same across all the major distros and available via some (systemd?) upstream. On another hand, the users of embedded devices, mentioned by Linus, should already know what scheduler to choose because dealing with embedded world assumes the person can decide this on their own, or with the help of abovementioned udev scripts and/or Documentation/ as a reference point. So I see no obstacles here, and the choice to rely on udev by default sounds reasonable. The question that remain is whether it is really important to mount a root partition while already using some specific scheduler? Why it cannot be done with "none", for instance? > So, probably 99% of Linux-kernel users will just stick to the default > I/O scheduler, mq-deadline, assuming that the algorithm by which that > scheduler was chosen was not "pick the scheduler with the longest > name", but "pick the best scheduler for most cases". The problem is > that, for single-queue devices with a speed below 400/500 KIOPS, the > default scheduler is apparently incomparably worse than bfq in terms > of responsiveness and latency for time-sensitive applications [2], and > in terms of throughput reached while controlling I/O [3]. And, in all > other tests ran so far, by any entity or group I'm aware of, bfq > results basically on par with or better than mq-deadline. And that's why major distributions are likely to default to BFQ via udev. No one argues with BFQ superiority here ☺. > So, I do understand your need for conservativeness, but, after so much > evidence on single-queue devices, and so many years! :), what's the > point in keeping Linux worse for virtually everybody, by default? From my point of view this is not a conservative approach at all. On contrary, offloading decisions to userspace aligns pretty well with recent trends like pressure metrics/userspace OOM killer, eBPF etc. The less unnecessary logic the kernel handles, the more flexibility it affords.
On Wed, Oct 03, 2018 at 07:42:15AM +0000, Damien Le Moal wrote: > Of note also is that host-managed like sequential zone devices are also likely > to show up soon with the work being done in the NVMe standard on the new "Zoned > namespace" feature proposal. These devices will also require a scheduler like > mq-deadline guaranteeing per-zone in-order delivery of sequential write > requests. Looking only at the number of queues of the device is not enough to > choose the best (most reasonnable/appropriate) scheduler. We actually have a plan to avoid the need for a non-reordering scheduler there (including a Linux prototype for it). Lets see if it survives the committee.
On Wed 03-10-18 08:53:37, Linus Walleij wrote: > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote: > > > So, I do understand your need for conservativeness, but, after so much > > evidence on single-queue devices, and so many years! :), what's the > > point in keeping Linux worse for virtually everybody, by default? > > I understand if we need to ease things in as well, I don't intend this > change for the current merge window or anything, since v4.19 > will notably have this patch: > > commit d5038a13eca72fb216c07eb717169092e92284f1 > Author: Johannes Thumshirn <jthumshirn@suse.de> > Date: Wed Jul 4 10:53:56 2018 +0200 > > scsi: core: switch to scsi-mq by default > > It has been more than one year since we tried to change the default from > legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to > scsi-mq"). But due to issues with suspend/resume and performance problems > it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default > to scsi-mq""). > > In the meantime there have been a substantial amount of performance > improvements and suspend/resume got fixed as well, thus we can re-enable > scsi-mq without a significant performance penalty. > > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > Reviewed-by: Hannes Reinecke <hare@suse.com> > Reviewed-by: Ming Lei <ming.lei@redhat.com> > Acked-by: John Garry <john.garry@huawei.com> > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > > I guess that patch can be a bit scary by itself. But IIUC it all went > fine this time! > > But hey, if that works, that means $SUBJECT patch will enable BFQ on all > libata devices and any SCSI that is single queue as well, not just > "obscure" stuff like MMC/SD and UBI, and that is > indeed a massive crowd of legacy devices. But we're talking > v4.21 here. > > Johannes, you might be interested in $SUBJECT patch. > It'd be nice to hear what SUSE people have to add, since they > are pretty proactive in this area. So we do have a udev rules in our distro which sets the IO scheduler based on device parameters (rotational at least, with blk-mq we might start considering number of queues as well, plus we have some exceptions like virtio, loop, etc.). So the kernel default doesn't concern us too much as a distro. I personally would consider bfq a safer default for single-queue devices (loop probably needs exception) but I don't feel too strongly about it. Honza
On Wed, Oct 03, 2018 at 01:49:25PM +0200, Oleksandr Natalenko wrote: > On another hand, the users of embedded devices, mentioned by Linus, should > already know what scheduler to choose because dealing with embedded world > assumes the person can decide this on their own, or with the help of > abovementioned udev scripts and/or Documentation/ as a reference point. That's not an entirely realistic assessment of a lot of practical embedded development - while people *can* go in and tweak things to their heart's content and some will have the time to do that there's a lot of small teams pulling together entire systems who rely fairly heavily on defaults, focusing most of their effort on the bits of code they directly wrote. You get things like people taking a copy of an embedded distro at some point and then only updating components that they specifically want to update like the new kernel with the drivers for the SoC in the new product. > So I see no obstacles here, and the choice to rely on udev by default sounds > reasonable. There's still a good number of users where there's a big discoverability problem here I fear. We have this regularly with the arm64 fixups for emulating old locking constructs that were removed from the architecture (useful for running old arm binaries on arm64 systems), that's got a Kconfig option but also requires enabling at runtime. I've had to help several users who were completely frustrated trying to get their old binaries working having upgraded to a kernel with the option, turned it on in Kconfig and then being unaware that there was also this hoop userspace had to jump through. This is less severe as it's only a performance thing but still potentially annoying.
On Wed, 2018-10-03 at 05:51 -0700, Christoph Hellwig wrote: > On Wed, Oct 03, 2018 at 07:42:15AM +0000, Damien Le Moal wrote: > > Of note also is that host-managed like sequential zone devices are also likely > > to show up soon with the work being done in the NVMe standard on the new "Zoned > > namespace" feature proposal. These devices will also require a scheduler like > > mq-deadline guaranteeing per-zone in-order delivery of sequential write > > requests. Looking only at the number of queues of the device is not enough to > > choose the best (most reasonnable/appropriate) scheduler. > > We actually have a plan to avoid the need for a non-reordering scheduler > there (including a Linux prototype for it). Lets see if it survives the > committee. Has the work with the T10 committee to standardize the SCSI equivalent of anonymous writes already started? Thanks, Bart.
On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote: > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous > writes already started? No, and I don't know of anyone who wants to do that in the short term.
On Wed, 2018-10-03 at 08:01 -0700, Christoph Hellwig wrote: > On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote: > > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous > > writes already started? > > No, and I don't know of anyone who wants to do that in the short term. That's unfortunate. I think having such a command available in the SCSI command set would be a step forward. Bart.
> Il giorno 02 ott 2018, alle ore 14:43, Linus Walleij <linus.walleij@linaro.org> ha scritto: > > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but notably also UBI and > the loopback device. > > I have been running it for a while without any negative > effects on my pet systems and I want some wider testing > so let's throw it out there and see what people say. > Admittedly my use cases are limited. > > I talked to Pavel a bit back and it turns out he has a > usecase for BFQ as well and I bet he also would like it > as default scheduler for that system (Pavel tell us more, > I don't remember what it was!) > > Intuitively I could understand that maybe we want to > leave the loop device Actually, I've tested loop devices too. And, also with these virtual devices, switching to bfq radically improves figures of merits as responsiveness and latency for soft real-time applications. Thanks, Paolo > (possibly others? nbd? rbd?) as > "none", as it is probably relying on a scheduler on the > device below it, so I'm open to passing in a scheduler hint > from the respective subsystem in say struct blk_mq_tag_set. > However that makes for a bit of syntactic dissonance > with the struct member ".nr_hw_queues" (I wonder how > the loop device can have 1 "hardware queue"?) so > maybe we should in that case also rename that struct > member to ".nr_queues" fair and square before we start > making adjustments for treating queues differently whether > they are in hardware or actually not. > > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Paolo Valente <paolo.valente@linaro.org> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Cc: Richard Weinberger <richard@nod.at> > Cc: Artem Bityutskiy <dedekind1@gmail.com> > Cc: Adrian Hunter <adrian.hunter@intel.com> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org> > --- > block/elevator.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/block/elevator.c b/block/elevator.c > index e18ac68626e3..e5a2c39eee7b 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -948,13 +948,15 @@ int elevator_switch_mq(struct request_queue *q, > } > > /* > - * For blk-mq devices, we default to using mq-deadline, if available, for single > - * queue devices. If deadline isn't available OR we have multiple queues, > - * default to "none". > + * For blk-mq devices, we default to using: > + * - "none" for multiqueue devices (nr_hw_queues != 1) > + * - "bfq", if available, for single queue devices > + * - "mq-deadline" if "bfq" is not available for single queue devices > + * - "none" for single queue devices as well as last resort > */ > int elevator_init_mq(struct request_queue *q) > { > - struct elevator_type *e; > + struct elevator_type *e = NULL; > int err = 0; > > if (q->nr_hw_queues != 1) > @@ -968,9 +970,14 @@ int elevator_init_mq(struct request_queue *q) > if (unlikely(q->elevator)) > goto out_unlock; > > - e = elevator_get(q, "mq-deadline", false); > - if (!e) > - goto out_unlock; > + if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) > + e = elevator_get(q, "bfq", false); > + > + if (!e) { > + e = elevator_get(q, "mq-deadline", false); > + if (!e) > + goto out_unlock; > + } > > err = blk_mq_init_sched(q, e); > if (err) > -- > 2.17.1 >
> Il giorno 03 ott 2018, alle ore 09:42, Damien Le Moal <damien.lemoal@wdc.com> ha scritto: > > On 2018/10/03 16:18, Linus Walleij wrote: >> On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote: >>> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: >>>> So, I do understand your need for conservativeness, but, after so much >>>> evidence on single-queue devices, and so many years! :), what's the >>>> point in keeping Linux worse for virtually everybody, by default? >>> >>> Sounds like what we just need a mechanism for the device (ubi block in >>> this case) to select the I/O scheduler. I doubt enhancing the default >>> scheduler selection logic in 'elevator.c' is the right answer. Just >>> give the driver authority to override the defaults. >> >> This might be true in the wider sense (like for what scheduler to >> select for an NVME device with N channels) but $SUBJECT is just >> trying to select BFQ (if available) for devices with one and only one >> hardware queue. >> >> That is AFAICT the only reasonable choice for anything with just >> one hardware queue as things stand right now. >> >> I have a slight reservation for the weird outliers like loopdev, which >> has "one hardware queue" (.nr_hw_queues == 1) though this >> makes no sense at all. So I would like to know what people think >> about that. Maybe we should have .nr_queues and .nr_hw_queues >> where the former is the number of logical queues and the latter >> the actual number of hardware queues. > > There is another class of outliers: host-managed SMR disks (SATA and SCSI, > definitely single hw queue). For these, using mq-deadline is mandatory in many > cases in order to guarantee sequential write command delivery to the device > driver. Having the default changed to bfq, which as far as I know is not SMR > friendly (can sequential writes within a single zone be reordered ?) is asking > for troubles (unaligned write errors showing up). > Hi Damien, actually I have followed threads on SMR device, and have already looked into this. I'm sorry for not having mentioned it in my first reply. My plan is to simply port this feature from mq-deadline to bfq. It should be really straightforward, especially after the testing you did through mq-deadline. Even if I'm missing some less trivial hidden issue, I guess it won't be impossible to address. If it may be useful for the outcome of this thread, I'm willing to raise the priority of this change to bfq. > A while back, we already had this discussion with Jens and Christoph on the list > to allow device drivers to set a sensible default I/O scheduler for devices with > "special needs" (e.g. host-managed SMR). At the time, the conclusion was that > udev (or something alike in userland) is better suited to set a correct scheduler. > > Of note also is that host-managed like sequential zone devices are also likely > to show up soon with the work being done in the NVMe standard on the new "Zoned > namespace" feature proposal. These devices will also require a scheduler like > mq-deadline guaranteeing per-zone in-order delivery of sequential write > requests. Looking only at the number of queues of the device is not enough to > choose the best (most reasonnable/appropriate) scheduler. > Until bfq simply handles SMR devices too. Thanks, Paolo > -- > Damien Le Moal > Western Digital Research
> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@linaro.org> ha scritto: > > On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > >> There is another class of outliers: host-managed SMR disks (SATA and SCSI, >> definitely single hw queue). For these, using mq-deadline is mandatory in many >> cases in order to guarantee sequential write command delivery to the device >> driver. Having the default changed to bfq, which as far as I know is not SMR >> friendly (can sequential writes within a single zone be reordered ?) is asking >> for troubles (unaligned write errors showing up). > > Ah, that is interesting. > > Which device driver files are we talking about here, specifically? > I'd like to take a look. > > I guess what you say is not that you are looking for the deadline > scheduling per se (as in deadline scheduling is nice), what you want is > the zone locking semantics in that scheduler, is that right? > > I.e. this business: > blk_queue_is_zoned(q) > blk_req_zone_write_lock(rq); > blk_req_zone_write_unlock(rq); > and mq-deadline solves this with a spinlock. > > I will augment the patch to enforce mq-deadline > if blk_queue_is_zoned(q) is true, as it is clear that > any device with that characteristic must use mq-deadline. > > Paoly might be interested in looking into whether BFQ could > also handle zoned devices in the future, I have no idea of how > hard that would be. > Absolutely, as I already wrote in my reply to Damien. In the meantime, Linus, augmenting your patch as you propose seems a clean and effective solution to me. Thanks, Paolo > The zoned business seems a bit fragile. Should it even be > allowed to select any other scheduler than deadline on these > devices? Presenting all compiled in schedulers in > /sysblock/device/queue/scheduler sounds like just giving > sysadmins too much rope. > > Yours, > Linus Walleij
On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: > [1] https://lkml.org/lkml/2017/2/21/791 > [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php > [3] https://lwn.net/Articles/763603/ From [2]: "BFQ loses about 18% with only random readers, because the number of IOPS becomes so high that the execution time and parallel efficiency of the schedulers becomes relevant." Since the number of I/O patterns for which results are available on [2] is limited and since the number of devices for which test results are available on [2] is limited (e.g. RAID is missing), there might be other cases in which configuring BFQ as the default would introduce a regression. I agree with Jens that it's best to leave it to the Linux distributors to select a default I/O scheduler. Bart.
> Il giorno 03 ott 2018, alle ore 13:49, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto: > > Hi. > > On 03.10.2018 08:29, Paolo Valente wrote: >> As also Linus Torvalds complained [1], people feel lost among >> I/O-scheduler options. Actual differences across I/O schedulers are >> basically obscure to non experts. In this respect, Linux-kernel >> 'users' are way more than a few top-level distros that can afford a >> strong performance team, and that, basing on the input of such a team, >> might venture light-heartedly to change a critical component like an >> I/O scheduler. Plus, as Linus Walleij pointed out, some users simply >> are not distros that use udev. > > I feel a contradiction in this counter-argument. On one hand, there are lots of, let's call them, home users, that use major distributions with udev, so the distribution maintainers can reasonably decide which scheduler to use for which type of device based on the udev rule and common sense provided via Documentation/ by linux-block devs. Moreover, most likely, those rules should be similar or the same across all the major distros and available via some (systemd?) upstream. > Let me basically repeat Mark's answer here, with my words. Unfortunately, facts mismatch with your optimistic view: after so many years and concordant test results, only very few distributions switched to bfq, no major distribution did (AFAIK). As I already wrote, the reason is the one pointed out by Torvalds [1]. Do you want a simple example? Take the last sentence in Jan's email in this thread: "I *personally would* consider bfq a safer default ... but *I don't feel too strongly* about it." And he is definitely a storage expert. The problem, in particular, is that bfq is a complex beast, fighting against a jungle of I/O issues. You have to be really into bfq, even to just know all of its features! > On another hand, the users of embedded devices, mentioned by Linus, should already know what scheduler to choose because dealing with embedded world assumes the person can decide this on their own, or with the help of abovementioned udev scripts and/or Documentation/ as a reference point. > Same situation for embedded devices, if not even worse. Again for the same reasons above. In the end, it is hard even for a kernel expert to be an in-depth expert of every possible complex component. > So I see no obstacles here, and the choice to rely on udev by default sounds reasonable. > > The question that remain is whether it is really important to mount a root partition while already using some specific scheduler? Why it cannot be done with "none", for instance? > >> So, probably 99% of Linux-kernel users will just stick to the default >> I/O scheduler, mq-deadline, assuming that the algorithm by which that >> scheduler was chosen was not "pick the scheduler with the longest >> name", but "pick the best scheduler for most cases". The problem is >> that, for single-queue devices with a speed below 400/500 KIOPS, the >> default scheduler is apparently incomparably worse than bfq in terms >> of responsiveness and latency for time-sensitive applications [2], and >> in terms of throughput reached while controlling I/O [3]. And, in all >> other tests ran so far, by any entity or group I'm aware of, bfq >> results basically on par with or better than mq-deadline. > > And that's why major distributions are likely to default to BFQ via udev. No one argues with BFQ superiority here ☺. > >> So, I do understand your need for conservativeness, but, after so much >> evidence on single-queue devices, and so many years! :), what's the >> point in keeping Linux worse for virtually everybody, by default? > > From my point of view this is not a conservative approach at all. On contrary, offloading decisions to userspace aligns pretty well with recent trends like pressure metrics/userspace OOM killer, eBPF etc. The less unnecessary logic the kernel handles, the more flexibility it affords. > To not answer too seriously here, let me answer with a quote that is still missing a clear paternity: "Everything should be made as simple as possible, but not simpler." :) Thanks, Paolo > -- > Oleksandr Natalenko (post-factum)
On Wed, 2018-10-03 at 17:55 +0200, Paolo Valente wrote: > The problem, in particular, is that bfq is a complex beast, fighting > against a jungle of I/O issues. You have to be really into bfq, even > to just know all of its features! This is a problem by itself. I don't know anyone who wants to have to deal with I/O scheduler tunables. Bart.
> Il giorno 03 ott 2018, alle ore 17:54, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: >> [1] https://lkml.org/lkml/2017/2/21/791 >> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php >> [3] https://lwn.net/Articles/763603/ > > From [2]: "BFQ loses about 18% with only random readers, because the number > of IOPS becomes so high that the execution time and parallel efficiency of > the schedulers becomes relevant." Since the number of I/O patterns for which > results are available on [2] is limited and since the number of devices for > which test results are available on [2] is limited (e.g. RAID is missing), > there might be other cases in which configuring BFQ as the default would > introduce a regression. > From [3]: none with throttling loses 80% of the throughput when used to control I/O. On any drive. And this is really only one example among a ton. In addition, the test you mention, designed by me, was meant exactly to find and show the worst breaking point of BFQ. If your main workload of interest is really made only of tens of parallel thread doing only sync random I/O, and you care only about throughput, without any concern for your system becoming so unresponsive to be unusable during the test, then, yes, mq-deadline is a better option for you. So, are you really sure the balance is in favor of mq-deadline? Thanks, Paolo > I agree with Jens that it's best to leave it to the Linux distributors to > select a default I/O scheduler. > > Bart.
> Il giorno 03 ott 2018, alle ore 18:00, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On Wed, 2018-10-03 at 17:55 +0200, Paolo Valente wrote: >> The problem, in particular, is that bfq is a complex beast, fighting >> against a jungle of I/O issues. You have to be really into bfq, even >> to just know all of its features! > > This is a problem by itself. I don't know anyone who wants to have to deal > with I/O scheduler tunables. > In fact, I designed and am constantly improving bfq, exactly so that you don't have to touch any tunable. Thanks, Paolo > Bart. >
> Il giorno 03 ott 2018, alle ore 18:02, Paolo Valente <paolo.valente@linaro.org> ha scritto: > > > >> Il giorno 03 ott 2018, alle ore 17:54, Bart Van Assche <bvanassche@acm.org> ha scritto: >> >> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote: >>> [1] https://lkml.org/lkml/2017/2/21/791 >>> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php >>> [3] https://lwn.net/Articles/763603/ >> >> From [2]: "BFQ loses about 18% with only random readers, because the number >> of IOPS becomes so high that the execution time and parallel efficiency of >> the schedulers becomes relevant." Since the number of I/O patterns for which >> results are available on [2] is limited and since the number of devices for >> which test results are available on [2] is limited (e.g. RAID is missing), >> there might be other cases in which configuring BFQ as the default would >> introduce a regression. >> > > From [3]: none with throttling loses 80% of the throughput when used > to control I/O. On any drive. And this is really only one example among a ton. > I forgot to add that the same 80% loss happens with mq-deadline plus throttling, sorry. In addition, mq-deadline suffers from much more than a 18% loss of throughput, w.r.t. bfq, exactly in the same figure you cited, if there are random writes too. > In addition, the test you mention, designed by me, was meant exactly > to find and show the worst breaking point of BFQ. If your main > workload of interest is really made only of tens of parallel thread > doing only sync random I/O, and you care only about throughput, > without any concern for your system becoming so unresponsive to be > unusable during the test, then, yes, mq-deadline is a better option > for you. > Some more detail on this. The fact that bfq reaches a lower throughput than none in this test is actually still puzzling me, because the process rate of I/O with bfq is one order of magnitude higher than the IOPS of this device. So, I still don't understand why, with bfq, the queue of the device does not get as full as with none, and thus why the throughput with bfq is not the same as with none. To further test this issue, I replaced sync I/O with async I/O (with a very high depth). And, nonsensically (for me), throughput dropped with both bfq and none! I already meant to to report this issue, after investigating it more. Anyway, this is a different story w.r.t. this thread. Thanks, Paolo > So, are you really sure the balance is in favor of mq-deadline? > > Thanks, > Paolo > >> I agree with Jens that it's best to leave it to the Linux distributors to >> select a default I/O scheduler. >> >> Bart. > > -- > You received this message because you are subscribed to the Google Groups "bfq-iosched" group. > To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
On Wed, Oct 3, 2018 at 11:53 AM, Paolo Valente <paolo.valente@linaro.org> wrote: > > >> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@linaro.org> ha scritto: >> >> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: >> >>> There is another class of outliers: host-managed SMR disks (SATA and SCSI, >>> definitely single hw queue). For these, using mq-deadline is mandatory in many >>> cases in order to guarantee sequential write command delivery to the device >>> driver. Having the default changed to bfq, which as far as I know is not SMR >>> friendly (can sequential writes within a single zone be reordered ?) is asking >>> for troubles (unaligned write errors showing up). >> >> Ah, that is interesting. >> >> Which device driver files are we talking about here, specifically? >> I'd like to take a look. >> >> I guess what you say is not that you are looking for the deadline >> scheduling per se (as in deadline scheduling is nice), what you want is >> the zone locking semantics in that scheduler, is that right? >> >> I.e. this business: >> blk_queue_is_zoned(q) >> blk_req_zone_write_lock(rq); >> blk_req_zone_write_unlock(rq); >> and mq-deadline solves this with a spinlock. >> >> I will augment the patch to enforce mq-deadline >> if blk_queue_is_zoned(q) is true, as it is clear that >> any device with that characteristic must use mq-deadline. >> >> Paoly might be interested in looking into whether BFQ could >> also handle zoned devices in the future, I have no idea of how >> hard that would be. >> > > Absolutely, as I already wrote in my reply to Damien. > > In the meantime, Linus, augmenting your patch as you propose seems > a clean and effective solution to me. > > Thanks, > Paolo > >> The zoned business seems a bit fragile. Should it even be >> allowed to select any other scheduler than deadline on these >> devices? Presenting all compiled in schedulers in >> /sysblock/device/queue/scheduler sounds like just giving >> sysadmins too much rope. >> >> Yours, >> Linus Walleij > Right now, users of host-managed SMR drives should be using "deadline" or "mq-deadline", to avoid out-of-order writes in sequential-only zones. I'm running into a situation right now on a test system (Fedora 28, 4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I accidentally forgot to add my "udev rule" file: # cat /etc/udev/rules.d/99-zoned-block-devices.rules ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline" ...and now, I see these messages when that specific SMR drive is mounted: kernel: F2FS-fs (sdc): IO Block Size: 4 KB kernel: F2FS-fs (sdc): Found nat_bits in checkpoint kernel: F2FS-fs (sdc): Mounted with checkpoint version = 212216ab kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) kernel: scsi_io_completion: 20 callbacks suppressed kernel: sd 7:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : Aborted Command [current] kernel: sd 7:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information kernel: sd 7:0:0:0: [sdb] tag#0 CDB: Write(16) 8a 00 00 00 00 00 3d d4 ec 99 00 00 00 80 00 00 I was also running into problems with creating new directories on this F2FS filesystem. However, "fsck.f2fs" reports no problems. So at this point, I created a new F2FS filesystem on a second SMR drive, and am currently copying the data from the "bad" F2FS filesystem to the "good" one. I wouldn't call zoned block devices "fragile"; they simply have I/O rules that didn't previously exist: all writes to sequential-only zones must be sequential. And one of the things that schedulers do is reorder writes. After 4.16, sd stopped being the "gatekeeper" of ensuring sequential writes, but the only "zoned-aware" schedulers were deadline and mq-deadline. Since my test system defaulted to "cfq", I ran into problems. So I welcome any changes that make it impossible for the user to "accidentally use the wrong scheduler". At least this time, I didn't "brick" my test system's BIOS, like I did back in May of this year [1]. Thanks, Bryan [1] https://www.spinics.net/lists/linux-block/msg26798.html
On Wed 03-10-18 17:55:41, Paolo Valente wrote: > > On 03.10.2018 08:29, Paolo Valente wrote: > >> As also Linus Torvalds complained [1], people feel lost among > >> I/O-scheduler options. Actual differences across I/O schedulers are > >> basically obscure to non experts. In this respect, Linux-kernel > >> 'users' are way more than a few top-level distros that can afford a > >> strong performance team, and that, basing on the input of such a team, > >> might venture light-heartedly to change a critical component like an > >> I/O scheduler. Plus, as Linus Walleij pointed out, some users simply > >> are not distros that use udev. > > > > I feel a contradiction in this counter-argument. On one hand, there are lots of, let's call them, home users, that use major distributions with udev, so the distribution maintainers can reasonably decide which scheduler to use for which type of device based on the udev rule and common sense provided via Documentation/ by linux-block devs. Moreover, most likely, those rules should be similar or the same across all the major distros and available via some (systemd?) upstream. > > > > Let me basically repeat Mark's answer here, with my words. > > Unfortunately, facts mismatch with your optimistic view: after so many > years and concordant test results, only very few distributions > switched to bfq, no major distribution did (AFAIK). As I already > wrote, the reason is the one pointed out by Torvalds [1]. Do you want > a simple example? Take the last sentence in Jan's email in this > thread: "I *personally would* consider bfq a safer default ... but *I > don't feel too strongly* about it." And he is definitely a storage > expert. Yeah, but let me add that currently all our released kernels still use legacy block stack for SCSI by default and thus CFQ/deadline. And once we feel scsi-mq + BFQ is comparable enough for rotating disks (which may be after your latest changes, Andreas will be running some larger evaluation), we are going to switch to that instead of scsi + CFQ. So it's not like for us it is a question between deadline-mq and BFQ, it is rather between scsi + CFQ vs scsi-mq + BFQ. Honza
On Wed, Oct 03, 2018 at 03:25:54PM +0200, Jan Kara wrote: > On Wed 03-10-18 08:53:37, Linus Walleij wrote: > > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote: > > > > > So, I do understand your need for conservativeness, but, after so much > > > evidence on single-queue devices, and so many years! :), what's the > > > point in keeping Linux worse for virtually everybody, by default? > > > > I understand if we need to ease things in as well, I don't intend this > > change for the current merge window or anything, since v4.19 > > will notably have this patch: > > > > commit d5038a13eca72fb216c07eb717169092e92284f1 > > Author: Johannes Thumshirn <jthumshirn@suse.de> > > Date: Wed Jul 4 10:53:56 2018 +0200 > > > > scsi: core: switch to scsi-mq by default > > > > It has been more than one year since we tried to change the default from > > legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to > > scsi-mq"). But due to issues with suspend/resume and performance problems > > it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default > > to scsi-mq""). > > > > In the meantime there have been a substantial amount of performance > > improvements and suspend/resume got fixed as well, thus we can re-enable > > scsi-mq without a significant performance penalty. > > > > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > > Reviewed-by: Hannes Reinecke <hare@suse.com> > > Reviewed-by: Ming Lei <ming.lei@redhat.com> > > Acked-by: John Garry <john.garry@huawei.com> > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > > > > I guess that patch can be a bit scary by itself. But IIUC it all went > > fine this time! > > > > But hey, if that works, that means $SUBJECT patch will enable BFQ on all > > libata devices and any SCSI that is single queue as well, not just > > "obscure" stuff like MMC/SD and UBI, and that is > > indeed a massive crowd of legacy devices. But we're talking > > v4.21 here. > > > > Johannes, you might be interested in $SUBJECT patch. > > It'd be nice to hear what SUSE people have to add, since they > > are pretty proactive in this area. > > So we do have a udev rules in our distro which sets the IO scheduler based > on device parameters (rotational at least, with blk-mq we might start > considering number of queues as well, plus we have some exceptions like > virtio, loop, etc.). So the kernel default doesn't concern us too much as a > distro. > > I personally would consider bfq a safer default for single-queue devices > (loop probably needs exception) but I don't feel too strongly about it. [Full quote for context] What about resurrecting CONFIG_DEFAULT_IOSCHED for MQ as well and leave it default to mq-deadline but give bfq, kyber and none as a choice as well? The question is shall we only do it for single queue devices or for native MQ devices as well if we go down that road? I understand the embedded floks will want a different interface than udev, but from the non-embedded point of view I'm with Jens and Jan here, let udev do the job. Johannes
On Wed, Oct 3, 2018 at 7:34 PM Bryan Gurney <bgurney@redhat.com> wrote: > Right now, users of host-managed SMR drives should be using "deadline" > or "mq-deadline", to avoid out-of-order writes in sequential-only > zones. > > I'm running into a situation right now on a test system (Fedora 28, > 4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I > accidentally forgot to add my "udev rule" file: This should be fixed after d5038a13eca7 scsi: core: switch to scsi-mq by default right? Since mq use mq-deadline by default. I'm making sure to preserve mq-deadline on zoned devices in my v2 of this patch. Yours, Linus Walleij
On Thu, Oct 04, 2018 at 09:45:35AM +0200, Johannes Thumshirn wrote: > On Wed, Oct 03, 2018 at 03:25:54PM +0200, Jan Kara wrote: > > On Wed 03-10-18 08:53:37, Linus Walleij wrote: > > > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote: > > > > > > > So, I do understand your need for conservativeness, but, after so much > > > > evidence on single-queue devices, and so many years! :), what's the > > > > point in keeping Linux worse for virtually everybody, by default? > > > > > > I understand if we need to ease things in as well, I don't intend this > > > change for the current merge window or anything, since v4.19 > > > will notably have this patch: > > > > > > commit d5038a13eca72fb216c07eb717169092e92284f1 > > > Author: Johannes Thumshirn <jthumshirn@suse.de> > > > Date: Wed Jul 4 10:53:56 2018 +0200 > > > > > > scsi: core: switch to scsi-mq by default > > > > > > It has been more than one year since we tried to change the default from > > > legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to > > > scsi-mq"). But due to issues with suspend/resume and performance problems > > > it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default > > > to scsi-mq""). > > > > > > In the meantime there have been a substantial amount of performance > > > improvements and suspend/resume got fixed as well, thus we can re-enable > > > scsi-mq without a significant performance penalty. > > > > > > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > > > Reviewed-by: Hannes Reinecke <hare@suse.com> > > > Reviewed-by: Ming Lei <ming.lei@redhat.com> > > > Acked-by: John Garry <john.garry@huawei.com> > > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > > > > > > I guess that patch can be a bit scary by itself. But IIUC it all went > > > fine this time! > > > > > > But hey, if that works, that means $SUBJECT patch will enable BFQ on all > > > libata devices and any SCSI that is single queue as well, not just > > > "obscure" stuff like MMC/SD and UBI, and that is > > > indeed a massive crowd of legacy devices. But we're talking > > > v4.21 here. > > > > > > Johannes, you might be interested in $SUBJECT patch. > > > It'd be nice to hear what SUSE people have to add, since they > > > are pretty proactive in this area. > > > > So we do have a udev rules in our distro which sets the IO scheduler based > > on device parameters (rotational at least, with blk-mq we might start > > considering number of queues as well, plus we have some exceptions like > > virtio, loop, etc.). So the kernel default doesn't concern us too much as a > > distro. > > > > I personally would consider bfq a safer default for single-queue devices > > (loop probably needs exception) but I don't feel too strongly about it. > > [Full quote for context] > > What about resurrecting CONFIG_DEFAULT_IOSCHED for MQ as well and > leave it default to mq-deadline but give bfq, kyber and none as a > choice as well? I second this -- introduction of a CONFIG_DEFAULT_MQ_IOSCHED. Having a default I/O scheduler kernel config option for MQ allows to build a kernel suitable for specific use w/o userspace dependencies. (But it still allows to reconfigure things via userspace.) > The question is shall we only do it for single queue devices or for > native MQ devices as well if we go down that road? Good question. I am not yet sure about this. I'd start with using the default for single queue devices. Andreas > I understand the embedded floks will want a different interface than > udev, but from the non-embedded point of view I'm with Jens and Jan > here, let udev do the job. > > Johannes > -- > Johannes Thumshirn Storage > jthumshirn@suse.de +49 911 74053 689 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
On Wed, Oct 3, 2018 at 1:49 PM Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > On another hand, the users of embedded devices, mentioned by Linus, > should already know what scheduler to choose because dealing with > embedded world assumes the person can decide this on their own, or with > the help of abovementioned udev scripts and/or Documentation/ as a > reference point. > > So I see no obstacles here, and the choice to rely on udev by default > sounds reasonable. I am sorry but I do not agree with this. There are several historical precedents where we have concluded that just "have the kernel do the right thing by default" is the way to go. Example 1: pluggable CPU schedulers. The reasoning was that users or distros have no clue what scheduler they want, only scheduler developers do. We drove it to the point where we have one and one scheduler only, not different flavors. (Special usecases have special scheduling classes inside the one scheduler instead.) Example 2: Automatic process group scheduling The reasoning was that daemons such as systemd would be better at placing processes/tasks into the right control groups to manage their resources, so this would be a userspace policy handled by the udev/systemd complex. We did not do that. Instead the kernel does autogrouping per-session, indeed it is a Kconfig option but even e.g. Fedora has this enabled by default. (commit 5091faa449ee) As pointed out elsewhere: these defaults make it easy for custom builds not using udev+systemd to get a system up and running with sensible defaults. Simple embedded systems use Busybox' mdev (I wouldn't trust it do do any complex decisions). OpenWRT has ubox+ubus+uci, also extremely lightweight, Android has its own init system that I don't manage to keep track of anymore. Instead of running all over the map and fixing these userspaces to do the right thing, it makes sense to make the right thing the default. And these are millions and millions of deployed systems not using udev+systemd we are talking about, they are not fringe hobby projects. It's not that I personally dislike udev or anything, I kind of like it, but these tailored distros simply don't use it and they are huge in numbers. They need help to do the right thing. Fixing a udev rule doesn't solve even half the world's problems I'm afraid. Yours, Linus Walleij
On 3 October 2018 at 19:34, Bryan Gurney <bgurney@redhat.com> wrote: > On Wed, Oct 3, 2018 at 11:53 AM, Paolo Valente <paolo.valente@linaro.org> wrote: >> >> >>> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@linaro.org> ha scritto: >>> >>> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: >>> >>>> There is another class of outliers: host-managed SMR disks (SATA and SCSI, >>>> definitely single hw queue). For these, using mq-deadline is mandatory in many >>>> cases in order to guarantee sequential write command delivery to the device >>>> driver. Having the default changed to bfq, which as far as I know is not SMR >>>> friendly (can sequential writes within a single zone be reordered ?) is asking >>>> for troubles (unaligned write errors showing up). >>> >>> Ah, that is interesting. >>> >>> Which device driver files are we talking about here, specifically? >>> I'd like to take a look. >>> >>> I guess what you say is not that you are looking for the deadline >>> scheduling per se (as in deadline scheduling is nice), what you want is >>> the zone locking semantics in that scheduler, is that right? >>> >>> I.e. this business: >>> blk_queue_is_zoned(q) >>> blk_req_zone_write_lock(rq); >>> blk_req_zone_write_unlock(rq); >>> and mq-deadline solves this with a spinlock. >>> >>> I will augment the patch to enforce mq-deadline >>> if blk_queue_is_zoned(q) is true, as it is clear that >>> any device with that characteristic must use mq-deadline. >>> >>> Paoly might be interested in looking into whether BFQ could >>> also handle zoned devices in the future, I have no idea of how >>> hard that would be. >>> >> >> Absolutely, as I already wrote in my reply to Damien. >> >> In the meantime, Linus, augmenting your patch as you propose seems >> a clean and effective solution to me. >> >> Thanks, >> Paolo >> >>> The zoned business seems a bit fragile. Should it even be >>> allowed to select any other scheduler than deadline on these >>> devices? Presenting all compiled in schedulers in >>> /sysblock/device/queue/scheduler sounds like just giving >>> sysadmins too much rope. >>> >>> Yours, >>> Linus Walleij >> > > Right now, users of host-managed SMR drives should be using "deadline" > or "mq-deadline", to avoid out-of-order writes in sequential-only > zones. > > I'm running into a situation right now on a test system (Fedora 28, > 4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I > accidentally forgot to add my "udev rule" file: > > # cat /etc/udev/rules.d/99-zoned-block-devices.rules > ACTION=="add|change", KERNEL=="sd[a-z]", > ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline" > > ...and now, I see these messages when that specific SMR drive is mounted: > > kernel: F2FS-fs (sdc): IO Block Size: 4 KB > kernel: F2FS-fs (sdc): Found nat_bits in checkpoint > kernel: F2FS-fs (sdc): Mounted with checkpoint version = 212216ab > kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), > sub_code(0x0000) > kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), > sub_code(0x0000) > kernel: scsi_io_completion: 20 callbacks suppressed > kernel: sd 7:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : Aborted Command [current] > kernel: sd 7:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information > kernel: sd 7:0:0:0: [sdb] tag#0 CDB: Write(16) 8a 00 00 00 00 00 3d d4 > ec 99 00 00 00 80 00 00 > > I was also running into problems with creating new directories on this > F2FS filesystem. However, "fsck.f2fs" reports no problems. So at > this point, I created a new F2FS filesystem on a second SMR drive, and > am currently copying the data from the "bad" F2FS filesystem to the > "good" one. > > I wouldn't call zoned block devices "fragile"; they simply have I/O > rules that didn't previously exist: all writes to sequential-only > zones must be sequential. And one of the things that schedulers do is > reorder writes. After 4.16, sd stopped being the "gatekeeper" of > ensuring sequential writes, but the only "zoned-aware" schedulers were > deadline and mq-deadline. Since my test system defaulted to "cfq", I > ran into problems. > > So I welcome any changes that make it impossible for the user to > "accidentally use the wrong scheduler". I fully agree. > > At least this time, I didn't "brick" my test system's BIOS, like I did > back in May of this year [1]. It sounds to me that the kernel isn't doing its job. In particular, the kernel have the information, as to be able to select the proper I/O scheduler (the block layer could just check BLK_ZONE_TYPE_SEQWRITE_REQ/ZBC_ZONE_TYPE_SEQWRITE_REQ). Instead it relies on userspace to do the right thing, it can't be right. Kind regards Uffe
On Thu, Oct 04, 2018 at 10:14:38AM +0200, Linus Walleij wrote: > And these are millions and millions of deployed > systems not using udev+systemd we are talking about, > they are not fringe hobby projects. It's not that I > personally dislike udev or anything, I kind of like > it, but these tailored distros simply don't use it > and they are huge in numbers. They need help to do > the right thing. Fixing a udev rule doesn't solve > even half the world's problems I'm afraid. Further, even those embedded systems that do use udev (some of them do lean heavily on modern init stuff like that and systemd, especially when boot time is a priority) they'll still need to get the relevant udev rule installed somehow.
On Thu, 2018-10-04 at 11:13 +0100, Mark Brown wrote: > On Thu, Oct 04, 2018 at 10:14:38AM +0200, Linus Walleij wrote: > > > And these are millions and millions of deployed > > systems not using udev+systemd we are talking about, > > they are not fringe hobby projects. It's not that I > > personally dislike udev or anything, I kind of like > > it, but these tailored distros simply don't use it > > and they are huge in numbers. They need help to do > > the right thing. Fixing a udev rule doesn't solve > > even half the world's problems I'm afraid. > > Further, even those embedded systems that do use udev (some of them do > lean heavily on modern init stuff like that and systemd, especially when > boot time is a priority) they'll still need to get the relevant udev > rule installed somehow. Hi Mark, Are you aware that the systemd source tree includes a set of udev rules? See also https://github.com/systemd/systemd/tree/master/rules. Bart.
On Thu, Oct 04, 2018 at 08:10:57AM -0700, Bart Van Assche wrote: > On Thu, 2018-10-04 at 11:13 +0100, Mark Brown wrote: > > Further, even those embedded systems that do use udev (some of them do > > lean heavily on modern init stuff like that and systemd, especially when > > boot time is a priority) they'll still need to get the relevant udev > > rule installed somehow. > Are you aware that the systemd source tree includes a set of udev rules? > See also https://github.com/systemd/systemd/tree/master/rules. Yeah, but then you're back to the situation where someone needs to go pick up a new version of systemd to get the new rules along with the new kernel. It's not insurmountable but it's an obstacle.
On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote: > > I agree with Jens that it's best to leave it to the Linux distributors to > > select a default I/O scheduler. > > That assumes such a thing exists. The kernel knows what devices it is > dealing with. The kernel 'default' ought to be 'whatever is usually best > for this device'. A distro cannot just pick a correct single default > because NVME and USB sticks are both normal and rather different in needs. Which I/O scheduler works best also depends which workload the user will run. BFQ has significant advantages for interactive workloads like video replay with concurrent background I/O but probably slows down kernel builds. That's why I'm not sure whether the kernel should select the default I/O scheduler. Bart.
> Il giorno 04 ott 2018, alle ore 21:25, Alan Cox <gnomes@lxorguk.ukuu.org.uk> ha scritto: > >> I agree with Jens that it's best to leave it to the Linux distributors to >> select a default I/O scheduler. > > That assumes such a thing exists. Well, as of now the default is more or less in the Schrdinger's cat state :) Metaphors apart, I do agree with your point. As for me, what you point out is one of the core issues at stake here. Thanks, Paolo > The kernel knows what devices it is > dealing with. The kernel 'default' ought to be 'whatever is usually best > for this device'. A distro cannot just pick a correct single default > because NVME and USB sticks are both normal and rather different in needs. > > Alan
> Il giorno 04 ott 2018, alle ore 22:09, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote: >>> I agree with Jens that it's best to leave it to the Linux distributors to >>> select a default I/O scheduler. >> >> That assumes such a thing exists. The kernel knows what devices it is >> dealing with. The kernel 'default' ought to be 'whatever is usually best >> for this device'. A distro cannot just pick a correct single default >> because NVME and USB sticks are both normal and rather different in needs. > > Which I/O scheduler works best also depends which workload the user will run. > BFQ has significant advantages for interactive workloads like video replay > with concurrent background I/O but probably slows down kernel builds. No, kernel build is, for evident reasons, one of the workloads I cared most about. Actually, I tried to focus on all my main kernel-development tasks, such as also git checkout, git merge, git grep, ... According to my test results, with BFQ these tasks are at least as fast as, or, in most system configurations, much faster than with the other schedulers. Of course, at the same time the system also remains responsive with BFQ. You can repeat these tests using one of my first scripts in the S suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more hypertrophied the names I gave :) ). I stopped sharing also my kernel-build results years ago, because I went on obtaining the same, identical good results for years, and I'm aware that I tend to show and say too much stuff. Thanks, Paolo > That's > why I'm not sure whether the kernel should select the default I/O scheduler. > > Bart.
On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote: > No, kernel build is, for evident reasons, one of the workloads I cared > most about. Actually, I tried to focus on all my main > kernel-development tasks, such as also git checkout, git merge, git > grep, ... > > According to my test results, with BFQ these tasks are at least as > fast as, or, in most system configurations, much faster than with the > other schedulers. Of course, at the same time the system also remains > responsive with BFQ. > > You can repeat these tests using one of my first scripts in the S > suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more > hypertrophied the names I gave :) ). > > I stopped sharing also my kernel-build results years ago, because I > went on obtaining the same, identical good results for years, and I'm > aware that I tend to show and say too much stuff. On my test setup building the kernel is slightly slower when using the BFQ scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD, single CPU with 6 cores, hyperthreading disabled). I am aware that the proposal at the start of this thread was to make BFQ the default for devices with a single hardware queue and not for devices like NVMe SSDs that support multiple hardware queues. What I think is missing is measurement results for BFQ on a system with multiple CPU sockets and against a fast storage medium. Eliminating the host lock from the SCSI core yielded a significant performance improvement for such storage devices. Since the BFQ scheduler locks and unlocks bfqd->lock for every dispatch operation it is very likely that BFQ will slow down I/O for fast storage devices, even if their driver only creates a single hardware queue. Bart.
On Wed, Oct 03, 2018 at 08:15:24AM -0700, Bart Van Assche wrote: > On Wed, 2018-10-03 at 08:01 -0700, Christoph Hellwig wrote: > > On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote: > > > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous > > > writes already started? > > > > No, and I don't know of anyone who wants to do that in the short term. > > That's unfortunate. I think having such a command available in the SCSI > command set would be a step forward. I'm not saying it doesn't make sense, only that I don't know of any short term plans.
On Thu, 2018-10-04 at 13:09 -0700, Bart Van Assche wrote: > On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote: > > > I agree with Jens that it's best to leave it to the Linux distributors to > > > select a default I/O scheduler. > > > > That assumes such a thing exists. The kernel knows what devices it is > > dealing with. The kernel 'default' ought to be 'whatever is usually best > > for this device'. A distro cannot just pick a correct single default > > because NVME and USB sticks are both normal and rather different in needs. > > Which I/O scheduler works best also depends which workload the user will run. > BFQ has significant advantages for interactive workloads like video replay > with concurrent background I/O but probably slows down kernel builds. That's > why I'm not sure whether the kernel should select the default I/O scheduler. Whats wrong with this simple hierarchy? 1. Block core selects the default scheduler. 2. Driver can overrule it early. 3. Userspace can overrule the default later. Everyone is happy. Good defaults in block core are great. Those defaults + #3 may cover 99% of the population. 1% of the population can use #2. See, Linus wants "bfq" for ubiblock. Why wouldn't we to let him work with UBI community, show that bfq is best for ubiblock, and just let the UBI community overrule the block core's default. If some day in the future there is a very good reason, we can even make this to be a module parameter, and people could just boot with 'ubiblock.iosched=bfq'.
Hi! > I talked to Pavel a bit back and it turns out he has a > usecase for BFQ as well and I bet he also would like it > as default scheduler for that system (Pavel tell us more, > I don't remember what it was!) I'm not sure I remember clearly, either. IIRC I was working with ionice on spinning disks, and it had no effect. I switched to BFQ and suddenly ionice was effective. Best regards, Pavel
On Thu 04-10-18 15:42:52, Bart Van Assche wrote: > On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote: > > No, kernel build is, for evident reasons, one of the workloads I cared > > most about. Actually, I tried to focus on all my main > > kernel-development tasks, such as also git checkout, git merge, git > > grep, ... > > > > According to my test results, with BFQ these tasks are at least as > > fast as, or, in most system configurations, much faster than with the > > other schedulers. Of course, at the same time the system also remains > > responsive with BFQ. > > > > You can repeat these tests using one of my first scripts in the S > > suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more > > hypertrophied the names I gave :) ). > > > > I stopped sharing also my kernel-build results years ago, because I > > went on obtaining the same, identical good results for years, and I'm > > aware that I tend to show and say too much stuff. > > On my test setup building the kernel is slightly slower when using the BFQ > scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD, > single CPU with 6 cores, hyperthreading disabled). I am aware that the > proposal at the start of this thread was to make BFQ the default for devices > with a single hardware queue and not for devices like NVMe SSDs that support > multiple hardware queues. > > What I think is missing is measurement results for BFQ on a system with > multiple CPU sockets and against a fast storage medium. Eliminating > the host lock from the SCSI core yielded a significant performance > improvement for such storage devices. Since the BFQ scheduler locks and > unlocks bfqd->lock for every dispatch operation it is very likely that BFQ > will slow down I/O for fast storage devices, even if their driver only > creates a single hardware queue. Well, I'm not sure why that is missing. I don't think anyone proposed to default to BFQ for such setup? Neither was anyone claiming that BFQ is better in such situation... The proposal has been: Default to BFQ for slow storage, leave it to deadline-mq otherwise. Honza
> Il giorno 05 ott 2018, alle ore 00:42, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote: >> No, kernel build is, for evident reasons, one of the workloads I cared >> most about. Actually, I tried to focus on all my main >> kernel-development tasks, such as also git checkout, git merge, git >> grep, ... >> >> According to my test results, with BFQ these tasks are at least as >> fast as, or, in most system configurations, much faster than with the >> other schedulers. Of course, at the same time the system also remains >> responsive with BFQ. >> >> You can repeat these tests using one of my first scripts in the S >> suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more >> hypertrophied the names I gave :) ). >> >> I stopped sharing also my kernel-build results years ago, because I >> went on obtaining the same, identical good results for years, and I'm >> aware that I tend to show and say too much stuff. > > On my test setup building the kernel is slightly slower when using the BFQ > scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD, > single CPU with 6 cores, hyperthreading disabled). I am aware that the > proposal at the start of this thread was to make BFQ the default for devices > with a single hardware queue and not for devices like NVMe SSDs that support > multiple hardware queues. > I miss your point: as you yourself note, the proposal is limited to single-queue devices, exactly because BFQ is not ready for multiple-queue devices yet. > What I think is missing is measurement results for BFQ on a system with > multiple CPU sockets and against a fast storage medium. It is not missing. As I happened to report in previous threads, we made a script to measure that too [1], using fio and null block. I have reported the results we obtained, for three classes of processors, in the in-kernel BFQ documentation [2]. In particular, BFQ reached 400KIOPS with the fastest CPU mentioned in that document (Intel i7-4850HQ). So, since the speed of that single-socket commodity CPU is most likely lower than the total speed of a multi-socket system, we have that, on such a system and with BFQ, you should be conservatively ok with single-queue devices in the range 300-500 KIOPS. [1] https://github.com/Algodev-github/IOSpeed [2] https://www.kernel.org/doc/Documentation/block/bfq-iosched.txt > > Eliminating > the host lock from the SCSI core yielded a significant performance > improvement for such storage devices. Since the BFQ scheduler locks and > unlocks bfqd->lock for every dispatch operation it is very likely that BFQ > will slow down I/O for fast storage devices, even if their driver only > creates a single hardware queue. > One of the main motivations behind NVMe, and blk-mq itself, is that it is hard to reach the above IOPS, and more, with a single I/O queue as bottleneck. So, I wouldn't expect that systems - equipped with single-queue drives reaching more than 500 KIOPS - using SATA or some other non-NVMe as protocol - so fast to push these drives to their maximum speeds constitute more than a negligible percentage of devices. So, by sticking to mq-deadline, we would sacrifice 99% of systems, to make sure, basically, that those very few systems on steroids reach maximum throughput with random I/O (while however still suffering from responsiveness problems). I think it makes much more sense to have as default what is best for 99% of the single-queue systems, with those super systems properly reconfigured by their users. For sure, other defaults are to be changed too, to get the most out of those systems. Thanks, Paolo > Bart.
Hi! > On another hand, the users of embedded devices, mentioned by Linus, > > should already know what scheduler to choose because dealing with > > embedded world assumes the person can decide this on their own, or with > > the help of abovementioned udev scripts and/or Documentation/ as a > > reference point. > > > > So I see no obstacles here, and the choice to rely on udev by default > > sounds reasonable. > > > > I am sorry but I do not agree with this. > > There are several historical precedents where we have > concluded that just "have the kernel do the right thing > by default" is the way to go. Kernel should do the right thing by default, I agree with Linus W. Having reasonable defaults is useful; yes, my desktop has udev etc, but I still want reasonable scheduler when doing fsck in init=/bin/bash. Plus, I update kernels more often than distros, and I run various embedded stuff. Kernel should just provide reasonable defaults. And yes, we have "ionice" command and yes, it would be nice if it worked by default... Pavel
On 10/5/18 2:16 AM, Jan Kara wrote: > On Thu 04-10-18 15:42:52, Bart Van Assche wrote: >> What I think is missing is measurement results for BFQ on a system with >> multiple CPU sockets and against a fast storage medium. Eliminating >> the host lock from the SCSI core yielded a significant performance >> improvement for such storage devices. Since the BFQ scheduler locks and >> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ >> will slow down I/O for fast storage devices, even if their driver only >> creates a single hardware queue. > > Well, I'm not sure why that is missing. I don't think anyone proposed to > default to BFQ for such setup? Neither was anyone claiming that BFQ is > better in such situation... The proposal has been: Default to BFQ for slow > storage, leave it to deadline-mq otherwise. Hi Jan, How do you define slow storage? The proposal at the start of this thread was to make BFQ the default for all block devices that create a single hardware queue. That includes all SATA storage since scsi-mq only creates a single hardware queue when using the SATA protocol. The proposal to make BFQ the default for systems with a single hard disk probably makes sense but I am not sure that making BFQ the default for systems equipped with one or more (SATA) SSDs is also a good idea. Especially for multi-socket systems since BFQ reintroduces a queue-wide lock. As you know no queue-wide locking happens during I/O in the scsi-mq core nor in the blk-mq core. Bart.
> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On 10/5/18 2:16 AM, Jan Kara wrote: >> On Thu 04-10-18 15:42:52, Bart Van Assche wrote: >>> What I think is missing is measurement results for BFQ on a system with >>> multiple CPU sockets and against a fast storage medium. Eliminating >>> the host lock from the SCSI core yielded a significant performance >>> improvement for such storage devices. Since the BFQ scheduler locks and >>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ >>> will slow down I/O for fast storage devices, even if their driver only >>> creates a single hardware queue. >> Well, I'm not sure why that is missing. I don't think anyone proposed to >> default to BFQ for such setup? Neither was anyone claiming that BFQ is >> better in such situation... The proposal has been: Default to BFQ for slow >> storage, leave it to deadline-mq otherwise. > > Hi Jan, > > How do you define slow storage? The proposal at the start of this thread was to make BFQ the default for all block devices that create a single hardware queue. That includes all SATA storage since scsi-mq only creates a single hardware queue when using the SATA protocol. The proposal to make BFQ the default for systems with a single hard disk probably makes sense but I am not sure that making BFQ the default for systems equipped with one or more (SATA) SSDs is also a good idea. Especially for multi-socket systems since BFQ reintroduces a queue-wide lock. No, BFQ has no queue-wide lock. The very first change made to BFQ for porting it to blk-mq was to remove the queue lock. Guided by Jens, I replaced that lock with the exact, same scheduler lock used in mq-deadline. Thanks, Paolo > As you know no queue-wide locking happens during I/O in the scsi-mq core nor in the blk-mq core. > > Bart.
On 10/5/18 11:46 PM, Paolo Valente wrote: >> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto: >> On 10/5/18 2:16 AM, Jan Kara wrote: >>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote: >>>> What I think is missing is measurement results for BFQ on a system with >>>> multiple CPU sockets and against a fast storage medium. Eliminating >>>> the host lock from the SCSI core yielded a significant performance >>>> improvement for such storage devices. Since the BFQ scheduler locks and >>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ >>>> will slow down I/O for fast storage devices, even if their driver only >>>> creates a single hardware queue. >>> Well, I'm not sure why that is missing. I don't think anyone proposed to >>> default to BFQ for such setup? Neither was anyone claiming that BFQ is >>> better in such situation... The proposal has been: Default to BFQ for slow >>> storage, leave it to deadline-mq otherwise. >> >> How do you define slow storage? The proposal at the start of this thread >> was to make BFQ the default for all block devices that create a single >> hardware queue. That includes all SATA storage since scsi-mq only creates >> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense >> but I am not sure that making BFQ the default for systems equipped with >> one or more (SATA) SSDs is also a good idea. Especially for multi-socket >> systems since BFQ reintroduces a queue-wide lock. > > No, BFQ has no queue-wide lock. The very first change made to BFQ for > porting it to blk-mq was to remove the queue lock. Guided by Jens, I > replaced that lock with the exact, same scheduler lock used in > mq-deadline. It's easy to see that both mq-deadline and BFQ define a queue-wide lock. For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That last lock serializes all bfq_dispatch_request() calls and hence reduces concurrency while processing I/O requests. From bfq_dispatch_request(): static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; [ ... ] spin_lock_irq(&bfqd->lock); [ ... ] } I think the above makes it very clear that bfqd->lock is queue-wide. It is easy to understand why both I/O schedulers need a queue-wide lock: the only way to avoid race conditions when considering all pending I/O requests for scheduling decisions is to use a lock that covers all pending requests and hence that is queue-wide. Bart.
> Il giorno 06 ott 2018, alle ore 18:20, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On 10/5/18 11:46 PM, Paolo Valente wrote: >>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto: >>> On 10/5/18 2:16 AM, Jan Kara wrote: >>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote: >>>>> What I think is missing is measurement results for BFQ on a system with >>>>> multiple CPU sockets and against a fast storage medium. Eliminating >>>>> the host lock from the SCSI core yielded a significant performance >>>>> improvement for such storage devices. Since the BFQ scheduler locks and >>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ >>>>> will slow down I/O for fast storage devices, even if their driver only >>>>> creates a single hardware queue. >>>> Well, I'm not sure why that is missing. I don't think anyone proposed to >>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is >>>> better in such situation... The proposal has been: Default to BFQ for slow >>>> storage, leave it to deadline-mq otherwise. >>> >>> How do you define slow storage? The proposal at the start of this thread >>> was to make BFQ the default for all block devices that create a single >>> hardware queue. That includes all SATA storage since scsi-mq only creates >>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense >>> but I am not sure that making BFQ the default for systems equipped with >>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket >>> systems since BFQ reintroduces a queue-wide lock. >> No, BFQ has no queue-wide lock. The very first change made to BFQ for >> porting it to blk-mq was to remove the queue lock. Guided by Jens, I >> replaced that lock with the exact, same scheduler lock used in >> mq-deadline. > > It's easy to see that both mq-deadline and BFQ define a queue-wide lock. For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That last lock serializes all bfq_dispatch_request() calls and hence reduces concurrency while processing I/O requests. From bfq_dispatch_request(): > > static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) > { > struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; > [ ... ] > spin_lock_irq(&bfqd->lock); > [ ... ] > } > > I think the above makes it very clear that bfqd->lock is queue-wide. > > It is easy to understand why both I/O schedulers need a queue-wide lock: the only way to avoid race conditions when considering all pending I/O requests for scheduling decisions is to use a lock that covers all pending requests and hence that is queue-wide. > Absolutely true. Queue lock is evidently a very general concept, and a lock on a scheduler is, in the end, a lock on its internal queue(s). But the queue lock removed by blk-mq is not that small per-scheduler lock, but the big, single-request-queue lock. The effects of the latter are probably almost one order of magnitude higher than those of a scheduler lock, even with a non-trivial scheduler like BFQ. As a simple concrete proof of this fact, consider the numbers that I already gave you, and that you can re-obtain in five minutes: on a laptop, BFQ may support up to 400KIOPS. Probably, even just with noop as I/O scheduler, the same PC cannot process so many IOPS with legacy blk (because of the single-request-queue lock). To sum up, in your argument you mixed two different locks. Anyway, you are going very deep in this issue. This takes you very close to what I'm currently working on (still in a design phase): increasing the parallel efficiency of BFQ, mainly by reducing the duration of the pieces of BFQ executed under its scheduler lock. But the goal of such a non-trivial improvement is to go from the current 400 KIOPS to more than one million of IOPS. This is an improvement that will most likely provide no benefits for probably 99% of the systems with single-queue devices. Those systems simply do no go beyond 300 KIOPS. So, I'm trying to first devote my limited single-person bandwidth (sorry, I didn't resist the temptation to joke on this growing discussion on single-something issues :) ) to improvements that make BFQ better within its current hardware scope. Thanks, Paolo > Bart.
diff --git a/block/elevator.c b/block/elevator.c index e18ac68626e3..e5a2c39eee7b 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -948,13 +948,15 @@ int elevator_switch_mq(struct request_queue *q, } /* - * For blk-mq devices, we default to using mq-deadline, if available, for single - * queue devices. If deadline isn't available OR we have multiple queues, - * default to "none". + * For blk-mq devices, we default to using: + * - "none" for multiqueue devices (nr_hw_queues != 1) + * - "bfq", if available, for single queue devices + * - "mq-deadline" if "bfq" is not available for single queue devices + * - "none" for single queue devices as well as last resort */ int elevator_init_mq(struct request_queue *q) { - struct elevator_type *e; + struct elevator_type *e = NULL; int err = 0; if (q->nr_hw_queues != 1) @@ -968,9 +970,14 @@ int elevator_init_mq(struct request_queue *q) if (unlikely(q->elevator)) goto out_unlock; - e = elevator_get(q, "mq-deadline", false); - if (!e) - goto out_unlock; + if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) + e = elevator_get(q, "bfq", false); + + if (!e) { + e = elevator_get(q, "mq-deadline", false); + if (!e) + goto out_unlock; + } err = blk_mq_init_sched(q, e); if (err)
This sets BFQ as the default scheduler for single queue block devices (nr_hw_queues == 1) if it is available. This affects notably MMC/SD-cards but notably also UBI and the loopback device. I have been running it for a while without any negative effects on my pet systems and I want some wider testing so let's throw it out there and see what people say. Admittedly my use cases are limited. I talked to Pavel a bit back and it turns out he has a usecase for BFQ as well and I bet he also would like it as default scheduler for that system (Pavel tell us more, I don't remember what it was!) Intuitively I could understand that maybe we want to leave the loop device (possibly others? nbd? rbd?) as "none", as it is probably relying on a scheduler on the device below it, so I'm open to passing in a scheduler hint from the respective subsystem in say struct blk_mq_tag_set. However that makes for a bit of syntactic dissonance with the struct member ".nr_hw_queues" (I wonder how the loop device can have 1 "hardware queue"?) so maybe we should in that case also rename that struct member to ".nr_queues" fair and square before we start making adjustments for treating queues differently whether they are in hardware or actually not. Cc: Pavel Machek <pavel@ucw.cz> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> --- block/elevator.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-)