Message ID | 20181015141059.26579-1-linus.walleij@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] block: BFQ default for single queue devices | expand |
> Il giorno 15 ott 2018, alle ore 16:10, Linus Walleij <linus.walleij@linaro.org> ha scritto: > > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but also UBI and the loopback > device. > > I have been running it for a while without any negative > effects on my pet systems and I want some wider testing > so let's throw it out there and see what people say. > Admittedly my use cases are limited. I need to keep this > patch around for my personal needs anyway. > > We take special care to avoid using BFQ on zoned devices > (in particular SMR, shingled magnetic recording devices) > as these currently require mq-deadline to group writes > together. > > I have opted against introducing any default scheduler > through Kconfig as the mq-deadline enforcement for > zoned devices has to be done at runtime anyways and > too many config options will make things confusing. > > My argument for setting a default policy in the kernel > as opposed to user space is the "reasonable defaults" > type, analogous to how we have one default CPU scheduling > policy (CFS) that make most sense for most tasks, and > how automatic process group scheduling happens in most > distributions without userspace involvement. The BFQ > scheduling policy makes most sense for single hardware > queue devices and many embedded systems will not have > the clever userspace tools (such as udev) to make an > educated choice of scheduling policy. Defaults should be > those that make most sense for the hardware. > > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Paolo Valente <paolo.valente@linaro.org> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Cc: Richard Weinberger <richard@nod.at> > Cc: Adrian Hunter <adrian.hunter@intel.com> > Cc: Bart Van Assche <bvanassche@acm.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Artem Bityutskiy <dedekind1@gmail.com> > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> > Cc: Mark Brown <broonie@kernel.org> > Cc: Damien Le Moal <Damien.LeMoal@wdc.com> > Cc: Johannes Thumshirn <jthumshirn@suse.de> > Cc: Oleksandr Natalenko <oleksandr@natalenko.name> > Cc: Jonathan Corbet <corbet@lwn.net> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Unless someone reports (hopefully reproducible) regressions with common single-queue hardware, then Acked-by: Paolo Valente <paolo.valente@linaro.org> Thanks, Paolo > --- > ChangeLog v1->v2: > - Add a quirk so that devices with zoned writes are forced > to use the deadline scheduler, this is necessary since only > that scheduler supports zoned writes. > - There is a summary article in LWN for subscribers: > https://lwn.net/Articles/767987/ > --- > block/elevator.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/block/elevator.c b/block/elevator.c > index 8fdcd64ae12e..6e6048ca3471 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q, > } > > /* > - * For blk-mq devices, we default to using mq-deadline, if available, for single > - * queue devices. If deadline isn't available OR we have multiple queues, > - * default to "none". > + * For blk-mq devices, we default to using: > + * - "none" for multiqueue devices (nr_hw_queues != 1) > + * - "bfq", if available, for single queue devices > + * - "mq-deadline" if "bfq" is not available for single queue devices > + * - "none" for single queue devices as well as last resort > */ > int elevator_init_mq(struct request_queue *q) > { > struct elevator_type *e; > + const char *policy; > int err = 0; > > if (q->nr_hw_queues != 1) > @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q) > if (unlikely(q->elevator)) > goto out_unlock; > > - e = elevator_get(q, "mq-deadline", false); > + /* > + * Zoned devices must use a deadline scheduler because currently > + * that is the only scheduler respecting zoned writes. > + */ > + if (blk_queue_is_zoned(q)) > + policy = "mq-deadline"; > + else if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) > + policy = "bfq"; > + else > + policy = "mq-deadline"; > + > + e = elevator_get(q, policy, false); > if (!e) > goto out_unlock; > > -- > 2.17.2 >
Hi. On 15.10.2018 16:10, Linus Walleij wrote: > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but also UBI and the loopback > device. > > I have been running it for a while without any negative > effects on my pet systems and I want some wider testing > so let's throw it out there and see what people say. > Admittedly my use cases are limited. I need to keep this > patch around for my personal needs anyway. > > We take special care to avoid using BFQ on zoned devices > (in particular SMR, shingled magnetic recording devices) > as these currently require mq-deadline to group writes > together. > > I have opted against introducing any default scheduler > through Kconfig as the mq-deadline enforcement for > zoned devices has to be done at runtime anyways and > too many config options will make things confusing. > > My argument for setting a default policy in the kernel > as opposed to user space is the "reasonable defaults" > type, analogous to how we have one default CPU scheduling > policy (CFS) that make most sense for most tasks, and > how automatic process group scheduling happens in most > distributions without userspace involvement. The BFQ > scheduling policy makes most sense for single hardware > queue devices and many embedded systems will not have > the clever userspace tools (such as udev) to make an > educated choice of scheduling policy. Defaults should be > those that make most sense for the hardware. > > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Paolo Valente <paolo.valente@linaro.org> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Cc: Richard Weinberger <richard@nod.at> > Cc: Adrian Hunter <adrian.hunter@intel.com> > Cc: Bart Van Assche <bvanassche@acm.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Artem Bityutskiy <dedekind1@gmail.com> > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> > Cc: Mark Brown <broonie@kernel.org> > Cc: Damien Le Moal <Damien.LeMoal@wdc.com> > Cc: Johannes Thumshirn <jthumshirn@suse.de> > Cc: Oleksandr Natalenko <oleksandr@natalenko.name> > Cc: Jonathan Corbet <corbet@lwn.net> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org> > --- > ChangeLog v1->v2: > - Add a quirk so that devices with zoned writes are forced > to use the deadline scheduler, this is necessary since only > that scheduler supports zoned writes. > - There is a summary article in LWN for subscribers: > https://lwn.net/Articles/767987/ > --- > block/elevator.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/block/elevator.c b/block/elevator.c > index 8fdcd64ae12e..6e6048ca3471 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q, > } > > /* > - * For blk-mq devices, we default to using mq-deadline, if available, > for single > - * queue devices. If deadline isn't available OR we have multiple > queues, > - * default to "none". > + * For blk-mq devices, we default to using: > + * - "none" for multiqueue devices (nr_hw_queues != 1) > + * - "bfq", if available, for single queue devices > + * - "mq-deadline" if "bfq" is not available for single queue devices > + * - "none" for single queue devices as well as last resort > */ > int elevator_init_mq(struct request_queue *q) > { > struct elevator_type *e; > + const char *policy; > int err = 0; > > if (q->nr_hw_queues != 1) > @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q) > if (unlikely(q->elevator)) > goto out_unlock; > > - e = elevator_get(q, "mq-deadline", false); > + /* > + * Zoned devices must use a deadline scheduler because currently > + * that is the only scheduler respecting zoned writes. > + */ > + if (blk_queue_is_zoned(q)) > + policy = "mq-deadline"; > + else if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) > + policy = "bfq"; > + else > + policy = "mq-deadline"; If more rules will be needed in the future, shall we just add extra ifs, or it would be better to craft some struct/table now + policy search helper? > + > + e = elevator_get(q, policy, false); > if (!e) > goto out_unlock;
On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote: > + * For blk-mq devices, we default to using: > + * - "none" for multiqueue devices (nr_hw_queues != 1) > + * - "bfq", if available, for single queue devices > + * - "mq-deadline" if "bfq" is not available for single queue devices > + * - "none" for single queue devices as well as last resort For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs. Since this patch is an attempt to improve performance, I'd like to see measurement data for one or more recent SATA SSDs before a decision is taken about what to do with this patch. Thanks, Bart.
On 10/15/18 8:10 AM, Linus Walleij wrote: > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but also UBI and the loopback > device. > > I have been running it for a while without any negative > effects on my pet systems and I want some wider testing > so let's throw it out there and see what people say. > Admittedly my use cases are limited. I need to keep this > patch around for my personal needs anyway. > > We take special care to avoid using BFQ on zoned devices > (in particular SMR, shingled magnetic recording devices) > as these currently require mq-deadline to group writes > together. > > I have opted against introducing any default scheduler > through Kconfig as the mq-deadline enforcement for > zoned devices has to be done at runtime anyways and > too many config options will make things confusing. > > My argument for setting a default policy in the kernel > as opposed to user space is the "reasonable defaults" > type, analogous to how we have one default CPU scheduling > policy (CFS) that make most sense for most tasks, and > how automatic process group scheduling happens in most > distributions without userspace involvement. The BFQ > scheduling policy makes most sense for single hardware > queue devices and many embedded systems will not have > the clever userspace tools (such as udev) to make an > educated choice of scheduling policy. Defaults should be > those that make most sense for the hardware. I still don't like this. There are going to be tons of cases where the single queue device is some hw raid setup or similar, where performance is going to be much worse with BFQ than it is with mq-deadline, for instance. That's just one case. This kind of policy does not belong in the kernel, at least not in the current form. If we had some sort of "enable best options for a desktop" then it could fall under that umbrella.
> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 10/15/18 8:10 AM, Linus Walleij wrote: >> This sets BFQ as the default scheduler for single queue >> block devices (nr_hw_queues == 1) if it is available. This >> affects notably MMC/SD-cards but also UBI and the loopback >> device. >> >> I have been running it for a while without any negative >> effects on my pet systems and I want some wider testing >> so let's throw it out there and see what people say. >> Admittedly my use cases are limited. I need to keep this >> patch around for my personal needs anyway. >> >> We take special care to avoid using BFQ on zoned devices >> (in particular SMR, shingled magnetic recording devices) >> as these currently require mq-deadline to group writes >> together. >> >> I have opted against introducing any default scheduler >> through Kconfig as the mq-deadline enforcement for >> zoned devices has to be done at runtime anyways and >> too many config options will make things confusing. >> >> My argument for setting a default policy in the kernel >> as opposed to user space is the "reasonable defaults" >> type, analogous to how we have one default CPU scheduling >> policy (CFS) that make most sense for most tasks, and >> how automatic process group scheduling happens in most >> distributions without userspace involvement. The BFQ >> scheduling policy makes most sense for single hardware >> queue devices and many embedded systems will not have >> the clever userspace tools (such as udev) to make an >> educated choice of scheduling policy. Defaults should be >> those that make most sense for the hardware. > > I still don't like this. There are going to be tons of > cases where the single queue device is some hw raid setup > or similar, where performance is going to be much worse with > BFQ than it is with mq-deadline, for instance. That's just > one case. > Hi Jens, in my RAID tests bfq performed as well as in non-RAID tests. Probably you refer to the fact that, in a RAID configuration, IOPS can become very high. But, if that is the case, then the response to your objections already emerged in the previous thread. Let me sum it up again. I tested bfq on virtually every device in the range from few hundred of IOPS to 50-100KIOPS. Then, through the public script I already mentioned, I found the maximum number of IOPS that bfq can handle: about 400K with a commodity CPU. In particular, in all my tests with real hardware, bfq - is not even comparable to that of any of the other scheduler, in terms of responsiveness, latency for real-time applications, ability to provide strong bandwidth guarantees, ability to boost throughput while guaranteeing bandwidths; - is a little worse than the other scheduler for only one test, on only some hardware: total throughput with random reads, were it may lose up to 10-15% of throughput. Of course, the scheduler that reach a higher throughput leave the machine unusable during the test. So I really cannot see a reason why bfq could do worse than any of these other schedulers for some single-queue device (conservatively) below 300KIOPS. Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are probably less than 1% of all the single-queue storage around (USB drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we are sacrificing 99% of the hardware, to help 1% of the hardware, for one kind of test cases. > This kind of policy does not belong in the kernel, at least > not in the current form. If we had some sort of "enable best > options for a desktop" then it could fall under that umbrella. > I don't think bfq can be considered a scheduler for only desktops any longer. Thanks, Paolo > -- > Jens Axboe
> Il giorno 15 ott 2018, alle ore 17:02, Bart Van Assche <bvanassche@acm.org> ha scritto: > > On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote: >> + * For blk-mq devices, we default to using: >> + * - "none" for multiqueue devices (nr_hw_queues != 1) >> + * - "bfq", if available, for single queue devices >> + * - "mq-deadline" if "bfq" is not available for single queue devices >> + * - "none" for single queue devices as well as last resort > > For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs. > Since this patch is an attempt to improve performance, I'd like to see > measurement data for one or more recent SATA SSDs before a decision is > taken about what to do with this patch. > Hi Bart, as I just wrote to Jens I don't think we need this test any longer. To save you one hope, I'll paste my reply to Jens below. Anyway, it is very easy to do the tests you ask: - take a kernel containing the last bfq commits, such as for-next - do, e.g., git clone https://github.com/Algodev-github/S.git cd S/run_multiple_benchmarks sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq none" - compare results Of course, do not do it for multi-queue devices or single-queues devices, on steroids, that do 400-500 KIOPS. I'll see if I can convince someone to repeat these tests with a recent SSD. And here is again my reply to Jens, which I think holds for your repeated objection too. I tested bfq on virtually every device in the range from few hundred of IOPS to 50-100KIOPS. Then, through the public script I already mentioned, I found the maximum number of IOPS that bfq can handle: about 400K with a commodity CPU. In particular, in all my tests with real hardware, bfq performance - is not even comparable to that of any of the other scheduler, in terms of responsiveness, latency for real-time applications, ability to provide strong bandwidth guarantees, ability to boost throughput while guaranteeing bandwidths; - is a little worse than the other schedulers for only one test, on only some hardware: total throughput with random reads, were it may lose up to 10-15% of throughput. Of course, the schedulers that reach a higher throughput leave the machine unusable during the test. So I really cannot see a reason why bfq could do worse than any of these other schedulers for some single-queue device (conservatively) below 300KIOPS. Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are probably less than 1% of all the single-queue storage around (USB drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we are sacrificing 99% of the hardware, to help 1% of the hardware for one kind of test cases. Thanks, Paolo > Thanks, > > Bart. >
On 10/15/18 12:26 PM, Paolo Valente wrote: > > >> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto: >> >> On 10/15/18 8:10 AM, Linus Walleij wrote: >>> This sets BFQ as the default scheduler for single queue >>> block devices (nr_hw_queues == 1) if it is available. This >>> affects notably MMC/SD-cards but also UBI and the loopback >>> device. >>> >>> I have been running it for a while without any negative >>> effects on my pet systems and I want some wider testing >>> so let's throw it out there and see what people say. >>> Admittedly my use cases are limited. I need to keep this >>> patch around for my personal needs anyway. >>> >>> We take special care to avoid using BFQ on zoned devices >>> (in particular SMR, shingled magnetic recording devices) >>> as these currently require mq-deadline to group writes >>> together. >>> >>> I have opted against introducing any default scheduler >>> through Kconfig as the mq-deadline enforcement for >>> zoned devices has to be done at runtime anyways and >>> too many config options will make things confusing. >>> >>> My argument for setting a default policy in the kernel >>> as opposed to user space is the "reasonable defaults" >>> type, analogous to how we have one default CPU scheduling >>> policy (CFS) that make most sense for most tasks, and >>> how automatic process group scheduling happens in most >>> distributions without userspace involvement. The BFQ >>> scheduling policy makes most sense for single hardware >>> queue devices and many embedded systems will not have >>> the clever userspace tools (such as udev) to make an >>> educated choice of scheduling policy. Defaults should be >>> those that make most sense for the hardware. >> >> I still don't like this. There are going to be tons of >> cases where the single queue device is some hw raid setup >> or similar, where performance is going to be much worse with >> BFQ than it is with mq-deadline, for instance. That's just >> one case. >> > > Hi Jens, > in my RAID tests bfq performed as well as in non-RAID tests. Probably > you refer to the fact that, in a RAID configuration, IOPS can become > very high. But, if that is the case, then the response to your > objections already emerged in the previous thread. Let me sum it up > again. > > I tested bfq on virtually every device in the range from few hundred > of IOPS to 50-100KIOPS. Then, through the public script I already > mentioned, I found the maximum number of IOPS that bfq can handle: > about 400K with a commodity CPU. > > In particular, in all my tests with real hardware, bfq > - is not even comparable to that of any of the other scheduler, in > terms of responsiveness, latency for real-time applications, ability > to provide strong bandwidth guarantees, ability to boost throughput > while guaranteeing bandwidths; > - is a little worse than the other scheduler for only one test, on > only some hardware: total throughput with random reads, were it may > lose up to 10-15% of throughput. Of course, the scheduler that reach > a higher throughput leave the machine unusable during the test. > > So I really cannot see a reason why bfq could do worse than any of > these other schedulers for some single-queue device (conservatively) > below 300KIOPS. > > Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are > probably less than 1% of all the single-queue storage around (USB > drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we > are sacrificing 99% of the hardware, to help 1% of the hardware, for > one kind of test cases. I should have been more clear - I'm not worried about IOPS overhead, I'm worried about scheduling decisions that lower performance on (for instance) raid composed of many drives (rotational or otherwise). If you have actual data (on what hardware, and what kind of tests) to disprove that worry, then that's great, and I'd love to see that.
> Il giorno 15 ott 2018, alle ore 21:26, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 10/15/18 12:26 PM, Paolo Valente wrote: >> >> >>> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto: >>> >>> On 10/15/18 8:10 AM, Linus Walleij wrote: >>>> This sets BFQ as the default scheduler for single queue >>>> block devices (nr_hw_queues == 1) if it is available. This >>>> affects notably MMC/SD-cards but also UBI and the loopback >>>> device. >>>> >>>> I have been running it for a while without any negative >>>> effects on my pet systems and I want some wider testing >>>> so let's throw it out there and see what people say. >>>> Admittedly my use cases are limited. I need to keep this >>>> patch around for my personal needs anyway. >>>> >>>> We take special care to avoid using BFQ on zoned devices >>>> (in particular SMR, shingled magnetic recording devices) >>>> as these currently require mq-deadline to group writes >>>> together. >>>> >>>> I have opted against introducing any default scheduler >>>> through Kconfig as the mq-deadline enforcement for >>>> zoned devices has to be done at runtime anyways and >>>> too many config options will make things confusing. >>>> >>>> My argument for setting a default policy in the kernel >>>> as opposed to user space is the "reasonable defaults" >>>> type, analogous to how we have one default CPU scheduling >>>> policy (CFS) that make most sense for most tasks, and >>>> how automatic process group scheduling happens in most >>>> distributions without userspace involvement. The BFQ >>>> scheduling policy makes most sense for single hardware >>>> queue devices and many embedded systems will not have >>>> the clever userspace tools (such as udev) to make an >>>> educated choice of scheduling policy. Defaults should be >>>> those that make most sense for the hardware. >>> >>> I still don't like this. There are going to be tons of >>> cases where the single queue device is some hw raid setup >>> or similar, where performance is going to be much worse with >>> BFQ than it is with mq-deadline, for instance. That's just >>> one case. >>> >> >> Hi Jens, >> in my RAID tests bfq performed as well as in non-RAID tests. Probably >> you refer to the fact that, in a RAID configuration, IOPS can become >> very high. But, if that is the case, then the response to your >> objections already emerged in the previous thread. Let me sum it up >> again. >> >> I tested bfq on virtually every device in the range from few hundred >> of IOPS to 50-100KIOPS. Then, through the public script I already >> mentioned, I found the maximum number of IOPS that bfq can handle: >> about 400K with a commodity CPU. >> >> In particular, in all my tests with real hardware, bfq >> - is not even comparable to that of any of the other scheduler, in >> terms of responsiveness, latency for real-time applications, ability >> to provide strong bandwidth guarantees, ability to boost throughput >> while guaranteeing bandwidths; >> - is a little worse than the other scheduler for only one test, on >> only some hardware: total throughput with random reads, were it may >> lose up to 10-15% of throughput. Of course, the scheduler that reach >> a higher throughput leave the machine unusable during the test. >> >> So I really cannot see a reason why bfq could do worse than any of >> these other schedulers for some single-queue device (conservatively) >> below 300KIOPS. >> >> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are >> probably less than 1% of all the single-queue storage around (USB >> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we >> are sacrificing 99% of the hardware, to help 1% of the hardware, for >> one kind of test cases. > > I should have been more clear - I'm not worried about IOPS overhead, > I'm worried about scheduling decisions that lower performance on > (for instance) raid composed of many drives (rotational or otherwise). > > If you have actual data (on what hardware, and what kind of tests) > to disprove that worry, then that's great, and I'd love to see that. > Here are some old results with a very simple configuration: http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/ http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/ http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/ Then I stopped repeating tests that always yielded the same good results. As for more professional systems, a well-known company doing real-time packet-traffic dumping asked me to modify bfq so as to guarantee lossless data writing also during queries. The involved box had a RAID reaching a few Gbps, and everything worked well. Anyway, if you have specific issues in mind, I can check more deeply. Thanks, Paolo > > -- > Jens Axboe
On 15 October 2018 at 16:10, Linus Walleij <linus.walleij@linaro.org> wrote: > This sets BFQ as the default scheduler for single queue > block devices (nr_hw_queues == 1) if it is available. This > affects notably MMC/SD-cards but also UBI and the loopback > device. > > I have been running it for a while without any negative > effects on my pet systems and I want some wider testing > so let's throw it out there and see what people say. > Admittedly my use cases are limited. I need to keep this > patch around for my personal needs anyway. > > We take special care to avoid using BFQ on zoned devices > (in particular SMR, shingled magnetic recording devices) > as these currently require mq-deadline to group writes > together. > > I have opted against introducing any default scheduler > through Kconfig as the mq-deadline enforcement for > zoned devices has to be done at runtime anyways and > too many config options will make things confusing. > > My argument for setting a default policy in the kernel > as opposed to user space is the "reasonable defaults" > type, analogous to how we have one default CPU scheduling > policy (CFS) that make most sense for most tasks, and > how automatic process group scheduling happens in most > distributions without userspace involvement. The BFQ > scheduling policy makes most sense for single hardware > queue devices and many embedded systems will not have > the clever userspace tools (such as udev) to make an > educated choice of scheduling policy. Defaults should be > those that make most sense for the hardware. As already stated for v1, this makes perfect sense to me, thanks for posting it! I do understand there is some pushback from Bart and Jens, around how to move this forward. However, let's hope they get convinced to try this out. When it comes to potential "performance" regressions, I am sure Paolo is standing-by to help out with BFQ changes, if needed. Moreover, we can always do a simple revert in worst case scenario, especially since the change is really limited. > > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Paolo Valente <paolo.valente@linaro.org> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Cc: Richard Weinberger <richard@nod.at> > Cc: Adrian Hunter <adrian.hunter@intel.com> > Cc: Bart Van Assche <bvanassche@acm.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Artem Bityutskiy <dedekind1@gmail.com> > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> > Cc: Mark Brown <broonie@kernel.org> > Cc: Damien Le Moal <Damien.LeMoal@wdc.com> > Cc: Johannes Thumshirn <jthumshirn@suse.de> > Cc: Oleksandr Natalenko <oleksandr@natalenko.name> > Cc: Jonathan Corbet <corbet@lwn.net> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org> So FWIW: Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Kind regards Uffe > --- > ChangeLog v1->v2: > - Add a quirk so that devices with zoned writes are forced > to use the deadline scheduler, this is necessary since only > that scheduler supports zoned writes. > - There is a summary article in LWN for subscribers: > https://lwn.net/Articles/767987/ > --- > block/elevator.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/block/elevator.c b/block/elevator.c > index 8fdcd64ae12e..6e6048ca3471 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q, > } > > /* > - * For blk-mq devices, we default to using mq-deadline, if available, for single > - * queue devices. If deadline isn't available OR we have multiple queues, > - * default to "none". > + * For blk-mq devices, we default to using: > + * - "none" for multiqueue devices (nr_hw_queues != 1) > + * - "bfq", if available, for single queue devices > + * - "mq-deadline" if "bfq" is not available for single queue devices > + * - "none" for single queue devices as well as last resort > */ > int elevator_init_mq(struct request_queue *q) > { > struct elevator_type *e; > + const char *policy; > int err = 0; > > if (q->nr_hw_queues != 1) > @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q) > if (unlikely(q->elevator)) > goto out_unlock; > > - e = elevator_get(q, "mq-deadline", false); > + /* > + * Zoned devices must use a deadline scheduler because currently > + * that is the only scheduler respecting zoned writes. > + */ > + if (blk_queue_is_zoned(q)) > + policy = "mq-deadline"; > + else if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) > + policy = "bfq"; > + else > + policy = "mq-deadline"; > + > + e = elevator_get(q, policy, false); > if (!e) > goto out_unlock; > > -- > 2.17.2 >
On 10/15/18 5:02 PM, Bart Van Assche wrote: > On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote: >> + * For blk-mq devices, we default to using: >> + * - "none" for multiqueue devices (nr_hw_queues != 1) >> + * - "bfq", if available, for single queue devices >> + * - "mq-deadline" if "bfq" is not available for single queue devices >> + * - "none" for single queue devices as well as last resort > > For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs. > Since this patch is an attempt to improve performance, I'd like to see > measurement data for one or more recent SATA SSDs before a decision is > taken about what to do with this patch. > > Thanks, > > Bart. > Hi, although these tests should be run for single-queue devices, I tried to run them on an NVMe high-performance device. Imho if results are good in such a "difficult to deal with" multi-queue device, they should be good enough also in a "simpler" single-queue storage device.. Testbed specs: kernel = 4.18.0 (from bfq dev branch [1], where bfq already contains also the commits that will be available from 4.20) fs = ext4 drive = ssd samsung 960 pro NVMe m.2 512gb Device data sheet specs state that under random IO: * QD 1 thread 1 * read = 14 kIOPS * write = 50 kIOPS * QD 32 thread 4 * read = write = 330 kIOPS What follows is a results summary; under requests I can give all results. The workload notation (e.g. 5r5w-seq) means: - num_readers (5r) - num_writers (5w) - sequential_io or random_io (-seq) # replayed gnome-terminal startup time (lower is better) workload bfq-mq [s] none [s] % gain -------- ---------- -------- ------ 10r-seq 0.3725 2.79 86.65 5r5w-seq 0.9725 5.53 82.41 # throughput (higher is better) workload bfq-mq [mb/s] none [mb/s] % gain --------- ------------- ----------- ------- 10r-rand 394.806 429.735 -8.128 10r-seq 1387.63 1431.81 -3.086 1r-seq 838.13 798.872 4.914 5r5w-rand 1118.12 1297.46 -13.822 5r5w-seq 1187 1313.8 -9.651 Thanks, Federico [1] https://github.com/Algodev-github/bfq-mq/commits/bfq-mq
> Il giorno 16 ott 2018, alle ore 18:14, Federico Motta <federico@willer.it> ha scritto: > > On 10/15/18 5:02 PM, Bart Van Assche wrote: >> On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote: >>> + * For blk-mq devices, we default to using: >>> + * - "none" for multiqueue devices (nr_hw_queues != 1) >>> + * - "bfq", if available, for single queue devices >>> + * - "mq-deadline" if "bfq" is not available for single queue devices >>> + * - "none" for single queue devices as well as last resort >> >> For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs. >> Since this patch is an attempt to improve performance, I'd like to see >> measurement data for one or more recent SATA SSDs before a decision is >> taken about what to do with this patch. >> >> Thanks, >> >> Bart. >> > > Hi, > although these tests should be run for single-queue devices, I tried to > run them on an NVMe high-performance device. Imho if results are good > in such a "difficult to deal with" multi-queue device, they should be > good enough also in a "simpler" single-queue storage device.. > > Testbed specs: > kernel = 4.18.0 (from bfq dev branch [1], where bfq already contains > also the commits that will be available from 4.20) > fs = ext4 > drive = ssd samsung 960 pro NVMe m.2 512gb > > Device data sheet specs state that under random IO: > * QD 1 thread 1 > * read = 14 kIOPS > * write = 50 kIOPS > * QD 32 thread 4 > * read = write = 330 kIOPS > > What follows is a results summary; under requests I can give all > results. The workload notation (e.g. 5r5w-seq) means: > - num_readers (5r) > - num_writers (5w) > - sequential_io or random_io (-seq) > > > # replayed gnome-terminal startup time (lower is better) > workload bfq-mq [s] none [s] % gain > -------- ---------- -------- ------ > 10r-seq 0.3725 2.79 86.65 > 5r5w-seq 0.9725 5.53 82.41 > > # throughput (higher is better) > workload bfq-mq [mb/s] none [mb/s] % gain > --------- ------------- ----------- ------- > 10r-rand 394.806 429.735 -8.128 > 10r-seq 1387.63 1431.81 -3.086 > 1r-seq 838.13 798.872 4.914 > 5r5w-rand 1118.12 1297.46 -13.822 > 5r5w-seq 1187 1313.8 -9.651 > A little unexpectedly for me, throughput loss for random I/O is even lower than what I have obtained with my nasty SATA SSD (and reported in my public results). I didn't expect that little loss with sequential parallel reads. Probably, when going multiqueue, there are changes I haven't even thought about (I have never even tested bfq on a multi-queue device). Thanks, Paolo > Thanks, > Federico > > [1] https://github.com/Algodev-github/bfq-mq/commits/bfq-mq
On 10/15/18 1:44 PM, Paolo Valente wrote: > > >> Il giorno 15 ott 2018, alle ore 21:26, Jens Axboe <axboe@kernel.dk> ha scritto: >> >> On 10/15/18 12:26 PM, Paolo Valente wrote: >>> >>> >>>> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@kernel.dk> ha scritto: >>>> >>>> On 10/15/18 8:10 AM, Linus Walleij wrote: >>>>> This sets BFQ as the default scheduler for single queue >>>>> block devices (nr_hw_queues == 1) if it is available. This >>>>> affects notably MMC/SD-cards but also UBI and the loopback >>>>> device. >>>>> >>>>> I have been running it for a while without any negative >>>>> effects on my pet systems and I want some wider testing >>>>> so let's throw it out there and see what people say. >>>>> Admittedly my use cases are limited. I need to keep this >>>>> patch around for my personal needs anyway. >>>>> >>>>> We take special care to avoid using BFQ on zoned devices >>>>> (in particular SMR, shingled magnetic recording devices) >>>>> as these currently require mq-deadline to group writes >>>>> together. >>>>> >>>>> I have opted against introducing any default scheduler >>>>> through Kconfig as the mq-deadline enforcement for >>>>> zoned devices has to be done at runtime anyways and >>>>> too many config options will make things confusing. >>>>> >>>>> My argument for setting a default policy in the kernel >>>>> as opposed to user space is the "reasonable defaults" >>>>> type, analogous to how we have one default CPU scheduling >>>>> policy (CFS) that make most sense for most tasks, and >>>>> how automatic process group scheduling happens in most >>>>> distributions without userspace involvement. The BFQ >>>>> scheduling policy makes most sense for single hardware >>>>> queue devices and many embedded systems will not have >>>>> the clever userspace tools (such as udev) to make an >>>>> educated choice of scheduling policy. Defaults should be >>>>> those that make most sense for the hardware. >>>> >>>> I still don't like this. There are going to be tons of >>>> cases where the single queue device is some hw raid setup >>>> or similar, where performance is going to be much worse with >>>> BFQ than it is with mq-deadline, for instance. That's just >>>> one case. >>>> >>> >>> Hi Jens, >>> in my RAID tests bfq performed as well as in non-RAID tests. Probably >>> you refer to the fact that, in a RAID configuration, IOPS can become >>> very high. But, if that is the case, then the response to your >>> objections already emerged in the previous thread. Let me sum it up >>> again. >>> >>> I tested bfq on virtually every device in the range from few hundred >>> of IOPS to 50-100KIOPS. Then, through the public script I already >>> mentioned, I found the maximum number of IOPS that bfq can handle: >>> about 400K with a commodity CPU. >>> >>> In particular, in all my tests with real hardware, bfq >>> - is not even comparable to that of any of the other scheduler, in >>> terms of responsiveness, latency for real-time applications, ability >>> to provide strong bandwidth guarantees, ability to boost throughput >>> while guaranteeing bandwidths; >>> - is a little worse than the other scheduler for only one test, on >>> only some hardware: total throughput with random reads, were it may >>> lose up to 10-15% of throughput. Of course, the scheduler that reach >>> a higher throughput leave the machine unusable during the test. >>> >>> So I really cannot see a reason why bfq could do worse than any of >>> these other schedulers for some single-queue device (conservatively) >>> below 300KIOPS. >>> >>> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are >>> probably less than 1% of all the single-queue storage around (USB >>> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we >>> are sacrificing 99% of the hardware, to help 1% of the hardware, for >>> one kind of test cases. >> >> I should have been more clear - I'm not worried about IOPS overhead, >> I'm worried about scheduling decisions that lower performance on >> (for instance) raid composed of many drives (rotational or otherwise). >> >> If you have actual data (on what hardware, and what kind of tests) >> to disprove that worry, then that's great, and I'd love to see that. >> > > Here are some old results with a very simple configuration: > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/ > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/ > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/ > > Then I stopped repeating tests that always yielded the same good results. > > As for more professional systems, a well-known company doing > real-time packet-traffic dumping asked me to modify bfq so as to > guarantee lossless data writing also during queries. The involved box > had a RAID reaching a few Gbps, and everything worked well. > > Anyway, if you have specific issues in mind, I can check more deeply. Do you have anything more recent? All of these predate the current code (by a lot), and isn't even mq. I'm mostly just interested in plain fast NVMe device, and a big box hardware raid setup with a ton of drives. I do still think that this should be going through the distros, they need to be the ones driving this, as they will ultimately be the ones getting customer reports on regressions. The qual/test cycle they do is useful for this. In mainline, if we make a change like this, we'll figure out if it worked many releases down the line.
> Il giorno 15 ott 2018, alle ore 20:34, Paolo Valente <paolo.valente@linaro.org> ha scritto: > > > >> Il giorno 15 ott 2018, alle ore 17:02, Bart Van Assche <bvanassche@acm.org> ha scritto: >> >> On Mon, 2018-10-15 at 16:10 +0200, Linus Walleij wrote: >>> + * For blk-mq devices, we default to using: >>> + * - "none" for multiqueue devices (nr_hw_queues != 1) >>> + * - "bfq", if available, for single queue devices >>> + * - "mq-deadline" if "bfq" is not available for single queue devices >>> + * - "none" for single queue devices as well as last resort >> >> For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs. >> Since this patch is an attempt to improve performance, I'd like to see >> measurement data for one or more recent SATA SSDs before a decision is >> taken about what to do with this patch. >> > > Hi Bart, > as I just wrote to Jens I don't think we need this test any longer. > To save you one hope, I'll paste my reply to Jens below. > > Anyway, it is very easy to do the tests you ask: > - take a kernel containing the last bfq commits, such as for-next > - do, e.g., > git clone https://github.com/Algodev-github/S.git > cd S/run_multiple_benchmarks > sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq none" > - compare results > Two things: 1) By mistake, I put 'none' in the last command line above, but it should be mq-deadline: sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq mq-deadline" 2) If you are worried about wearing your device with writes, then just append 'raw' to the last command line. So: sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq mq-deadline" raw 'raw' means: "don't even create files for the background traffic, but just read raw sectors". Thanks, Paolo > Of course, do not do it for multi-queue devices or single-queues > devices, on steroids, that do 400-500 KIOPS. > > I'll see if I can convince someone to repeat these tests with a recent > SSD. > > And here is again my reply to Jens, which I think holds for your repeated > objection too. > > I tested bfq on virtually every device in the range from few hundred > of IOPS to 50-100KIOPS. Then, through the public script I already > mentioned, I found the maximum number of IOPS that bfq can handle: > about 400K with a commodity CPU. > > In particular, in all my tests with real hardware, bfq performance > - is not even comparable to that of any of the other scheduler, in > terms of responsiveness, latency for real-time applications, ability > to provide strong bandwidth guarantees, ability to boost throughput > while guaranteeing bandwidths; > - is a little worse than the other schedulers for only one test, on > only some hardware: total throughput with random reads, were it may > lose up to 10-15% of throughput. Of course, the schedulers that reach > a higher throughput leave the machine unusable during the test. > > So I really cannot see a reason why bfq could do worse than any of > these other schedulers for some single-queue device (conservatively) > below 300KIOPS. > > Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are > probably less than 1% of all the single-queue storage around (USB > drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we > are sacrificing 99% of the hardware, to help 1% of the hardware for > one kind of test cases. > > Thanks, > Paolo > >> Thanks, >> >> Bart. >> > > -- > You received this message because you are subscribed to the Google Groups "bfq-iosched" group. > To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
On Tue 16-10-18 11:35:59, Jens Axboe wrote: > On 10/15/18 1:44 PM, Paolo Valente wrote: > > Here are some old results with a very simple configuration: > > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/ > > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/ > > http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/ > > > > Then I stopped repeating tests that always yielded the same good results. > > > > As for more professional systems, a well-known company doing > > real-time packet-traffic dumping asked me to modify bfq so as to > > guarantee lossless data writing also during queries. The involved box > > had a RAID reaching a few Gbps, and everything worked well. > > > > Anyway, if you have specific issues in mind, I can check more deeply. > > Do you have anything more recent? All of these predate the current > code (by a lot), and isn't even mq. I'm mostly just interested in > plain fast NVMe device, and a big box hardware raid setup with > a ton of drives. > > I do still think that this should be going through the distros, they > need to be the ones driving this, as they will ultimately be the > ones getting customer reports on regressions. The qual/test cycle > they do is useful for this. In mainline, if we make a change like > this, we'll figure out if it worked many releases down the line. Well, the problem with this is that big distro people really don't care much because they already use udev for tuning the IO scheduler. So whatever defaults the kernel is going to pick likely won't be seen by distro customers. Embedded people seem to be driving this effort because they either don't run udev or they feel not all their teams building new products have enough expertise to come up with a proper set of rules... Honza
On 10/17/18 3:05 AM, Jan Kara wrote: > Well, the problem with this is that big distro people really don't care > much because they already use udev for tuning the IO scheduler. So whatever > defaults the kernel is going to pick likely won't be seen by distro > customers. Embedded people seem to be driving this effort because they > either don't run udev or they feel not all their teams building new > products have enough expertise to come up with a proper set of rules... What's missing in this discussion is a definition of "embedded system". Is that a system like a streaming player for TV channels that neither has a keyboard nor a display or a system that can run multiple apps simultaneously like a smartphone? I think the difference matters because some embedded devices hardly do any background I/O nor load any executable code from storage after boot. So at least for some embedded devices the problem discussed in this e-mail thread does not exist. Bart.
On Wed, Oct 17, 2018 at 10:48 AM, Bart Van Assche <bvanassche@acm.org> wrote: > On 10/17/18 3:05 AM, Jan Kara wrote: >> >> Well, the problem with this is that big distro people really don't care >> much because they already use udev for tuning the IO scheduler. So >> whatever >> defaults the kernel is going to pick likely won't be seen by distro >> customers. Embedded people seem to be driving this effort because they >> either don't run udev or they feel not all their teams building new >> products have enough expertise to come up with a proper set of rules... > > > What's missing in this discussion is a definition of "embedded system". Is > that a system like a streaming player for TV channels that neither has a > keyboard nor a display or a system that can run multiple apps simultaneously > like a smartphone? I think the difference matters because some embedded > devices hardly do any background I/O nor load any executable code from > storage after boot. So at least for some embedded devices the problem > discussed in this e-mail thread does not exist. > > Bart. There are high-performance embedded systems on the market (NAS, etc.). I feel strongly about the prevention of users running into errors because of an incorrect scheduler default, because I encountered that situation three times in my testing with zoned block devices. The switch to SCSI_MQ would resolve that, since mq-deadline is the default, but in my case, I was using Fedora 28, which disables CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my default scheduler was cfq. Hopefully there aren't any other cases where choosing the "wrong default scheduler" leads to errors. Ideally the default scheduler choice should prevent any errors, leaving it up to the distros to configure a default via other methods, to optimize for performance. Thanks, Bryan
On Wed, Oct 17, 2018 at 07:48:33AM -0700, Bart Van Assche wrote: > On 10/17/18 3:05 AM, Jan Kara wrote: > > Well, the problem with this is that big distro people really don't care > > much because they already use udev for tuning the IO scheduler. So whatever > > defaults the kernel is going to pick likely won't be seen by distro > > customers. Embedded people seem to be driving this effort because they > > either don't run udev or they feel not all their teams building new > > products have enough expertise to come up with a proper set of rules... > What's missing in this discussion is a definition of "embedded system". Is > that a system like a streaming player for TV channels that neither has a > keyboard nor a display or a system that can run multiple apps simultaneously > like a smartphone? I think the difference matters because some embedded > devices hardly do any background I/O nor load any executable code from > storage after boot. So at least for some embedded devices the problem > discussed in this e-mail thread does not exist. It's a combination of things - smartphones are definitely part of the target audience but other things can be affected, I'd guess your streaming TV player example can have issues if it's got local storage and downloads things in the background for example. There's definitely systems that never really use storage once they're booted but there's also things that move data around and/or have interactive apps. Even with some of the things that don't really use storage at runtime it can be important to help cut down boot times.
On 10/17/18 4:05 AM, Jan Kara wrote: > On Tue 16-10-18 11:35:59, Jens Axboe wrote: >> On 10/15/18 1:44 PM, Paolo Valente wrote: >>> Here are some old results with a very simple configuration: >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/ >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/ >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/ >>> >>> Then I stopped repeating tests that always yielded the same good results. >>> >>> As for more professional systems, a well-known company doing >>> real-time packet-traffic dumping asked me to modify bfq so as to >>> guarantee lossless data writing also during queries. The involved box >>> had a RAID reaching a few Gbps, and everything worked well. >>> >>> Anyway, if you have specific issues in mind, I can check more deeply. >> >> Do you have anything more recent? All of these predate the current >> code (by a lot), and isn't even mq. I'm mostly just interested in >> plain fast NVMe device, and a big box hardware raid setup with >> a ton of drives. >> >> I do still think that this should be going through the distros, they >> need to be the ones driving this, as they will ultimately be the >> ones getting customer reports on regressions. The qual/test cycle >> they do is useful for this. In mainline, if we make a change like >> this, we'll figure out if it worked many releases down the line. > > Well, the problem with this is that big distro people really don't care > much because they already use udev for tuning the IO scheduler. So whatever > defaults the kernel is going to pick likely won't be seen by distro > customers. Embedded people seem to be driving this effort because they > either don't run udev or they feel not all their teams building new > products have enough expertise to come up with a proper set of rules... Which is also the approach that I've been advocating for here, instead of a kernel patch...
On Wed 17-10-18 10:29:22, Jens Axboe wrote: > On 10/17/18 4:05 AM, Jan Kara wrote: > > On Tue 16-10-18 11:35:59, Jens Axboe wrote: > >> On 10/15/18 1:44 PM, Paolo Valente wrote: > >>> Here are some old results with a very simple configuration: > >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/ > >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/ > >>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/ > >>> > >>> Then I stopped repeating tests that always yielded the same good results. > >>> > >>> As for more professional systems, a well-known company doing > >>> real-time packet-traffic dumping asked me to modify bfq so as to > >>> guarantee lossless data writing also during queries. The involved box > >>> had a RAID reaching a few Gbps, and everything worked well. > >>> > >>> Anyway, if you have specific issues in mind, I can check more deeply. > >> > >> Do you have anything more recent? All of these predate the current > >> code (by a lot), and isn't even mq. I'm mostly just interested in > >> plain fast NVMe device, and a big box hardware raid setup with > >> a ton of drives. > >> > >> I do still think that this should be going through the distros, they > >> need to be the ones driving this, as they will ultimately be the > >> ones getting customer reports on regressions. The qual/test cycle > >> they do is useful for this. In mainline, if we make a change like > >> this, we'll figure out if it worked many releases down the line. > > > > Well, the problem with this is that big distro people really don't care > > much because they already use udev for tuning the IO scheduler. So whatever > > defaults the kernel is going to pick likely won't be seen by distro > > customers. Embedded people seem to be driving this effort because they > > either don't run udev or they feel not all their teams building new > > products have enough expertise to come up with a proper set of rules... > > Which is also the approach that I've been advocating for here, instead > of a kernel patch... I know you've been advocating the use of udev for IO scheduler selection. But do you want to force everybody to use udev? And for people who build their own (usually small) systems, do you want to force them to think about IO scheduler selection and writing appropriate rules? These are the problems people were mentioning and I'm not sure what is your opinion on this. Honza
On 10/18/18 1:21 AM, Jan Kara wrote: > On Wed 17-10-18 10:29:22, Jens Axboe wrote: >> On 10/17/18 4:05 AM, Jan Kara wrote: >>> On Tue 16-10-18 11:35:59, Jens Axboe wrote: >>>> On 10/15/18 1:44 PM, Paolo Valente wrote: >>>>> Here are some old results with a very simple configuration: >>>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/ >>>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/ >>>>> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/ >>>>> >>>>> Then I stopped repeating tests that always yielded the same good results. >>>>> >>>>> As for more professional systems, a well-known company doing >>>>> real-time packet-traffic dumping asked me to modify bfq so as to >>>>> guarantee lossless data writing also during queries. The involved box >>>>> had a RAID reaching a few Gbps, and everything worked well. >>>>> >>>>> Anyway, if you have specific issues in mind, I can check more deeply. >>>> >>>> Do you have anything more recent? All of these predate the current >>>> code (by a lot), and isn't even mq. I'm mostly just interested in >>>> plain fast NVMe device, and a big box hardware raid setup with >>>> a ton of drives. >>>> >>>> I do still think that this should be going through the distros, they >>>> need to be the ones driving this, as they will ultimately be the >>>> ones getting customer reports on regressions. The qual/test cycle >>>> they do is useful for this. In mainline, if we make a change like >>>> this, we'll figure out if it worked many releases down the line. >>> >>> Well, the problem with this is that big distro people really don't care >>> much because they already use udev for tuning the IO scheduler. So whatever >>> defaults the kernel is going to pick likely won't be seen by distro >>> customers. Embedded people seem to be driving this effort because they >>> either don't run udev or they feel not all their teams building new >>> products have enough expertise to come up with a proper set of rules... >> >> Which is also the approach that I've been advocating for here, instead >> of a kernel patch... > > I know you've been advocating the use of udev for IO scheduler selection. > But do you want to force everybody to use udev? And for people who build > their own (usually small) systems, do you want to force them to think about > IO scheduler selection and writing appropriate rules? These are the > problems people were mentioning and I'm not sure what is your opinion on > this. I don't want to force everybody to use udev, use whatever you like on your platform. For most people that is udev, for embedded it's something else. As you said, distros already do this via udev. When I've had to do it on my systems, I've added a udev rule to do it. My opinion is that the kernel makes various schedulers available. Deciding which one to use is policy that should go into user space. The default should be something that's solid and works, fancier setups and tuning should be left to user space.
Hi! > >> Which is also the approach that I've been advocating for here, instead > >> of a kernel patch... > > > > I know you've been advocating the use of udev for IO scheduler selection. > > But do you want to force everybody to use udev? And for people who build > > their own (usually small) systems, do you want to force them to think about > > IO scheduler selection and writing appropriate rules? These are the > > problems people were mentioning and I'm not sure what is your opinion on > > this. > > I don't want to force everybody to use udev, use whatever you like on > your platform. For most people that is udev, for embedded it's something > else. As you said, distros already do this via udev. When I've had to > do it on my systems, I've added a udev rule to do it. This is not really helpful. So you want me and everyone else and everyone on embedded to mess with udev? No, thanks. There are people booting with init=/bin/bash, too, running fsck. Would not it be nice to use reasonable schedulers there? > My opinion is that the kernel makes various schedulers available. > Deciding which one to use is policy that should go into user space. > The default should be something that's solid and works, fancier > setups and tuning should be left to user space. Kernel should do reasonable thing by default, and it seems to be easy in this case. You keep repeating "but someone's super fast raid might get slowed down". Those 5 people in the world probably already have their udev rules. Now, lets do the right thing by default for the rest of the world, including you. Pavel
On Mon, Oct 15, 2018 at 4:32 PM Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > On 15.10.2018 16:10, Linus Walleij wrote: > > + /* > > + * Zoned devices must use a deadline scheduler because currently > > + * that is the only scheduler respecting zoned writes. > > + */ > > + if (blk_queue_is_zoned(q)) > > + policy = "mq-deadline"; > > + else if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) > > + policy = "bfq"; > > + else > > + policy = "mq-deadline"; > > If more rules will be needed in the future, shall we just add extra ifs, > or it would be better to craft some struct/table now + policy search > helper? Let's do it when it happens. Premature optimization is the root of all evil ;) Yours, Linus Walleij
On Wed, Oct 17, 2018 at 4:59 PM Bryan Gurney <bgurney@redhat.com> wrote: > I feel strongly about the prevention of users running into errors > because of an incorrect scheduler default, because I encountered that > situation three times in my testing with zoned block devices. The > switch to SCSI_MQ would resolve that, since mq-deadline is the > default, but in my case, I was using Fedora 28, which disables > CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my > default scheduler was cfq. I think we should make a patch to the kernel that makes it impossible (even from sysfs) to choose a non-zone aware scheduler for these devices. It's another topic than $SUBJECT patch though. I take this into account in this version. Yours, Linus Walleij
Hi. On 19.10.2018 10:33, Linus Walleij wrote: >> > + /* >> > + * Zoned devices must use a deadline scheduler because currently >> > + * that is the only scheduler respecting zoned writes. >> > + */ >> > + if (blk_queue_is_zoned(q)) >> > + policy = "mq-deadline"; >> > + else if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) >> > + policy = "bfq"; >> > + else >> > + policy = "mq-deadline"; >> >> If more rules will be needed in the future, shall we just add extra >> ifs, >> or it would be better to craft some struct/table now + policy search >> helper? > > Let's do it when it happens. Premature optimization is the root > of all evil ;) I'd say, this is a matter of code readability, not optimisations. I do not strongly object against current approach, though.
> Il giorno 15 ott 2018, alle ore 20:26, Paolo Valente <paolo.valente@linaro.org> ha scritto: > > ... >> This kind of policy does not belong in the kernel, at least >> not in the current form. If we had some sort of "enable best >> options for a desktop" then it could fall under that umbrella. >> > > I don't think bfq can be considered a scheduler for only desktops any > longer. > Hi Jens, this reply of mine went on bugging me, until I understood my mistake. The fact that I consider bfq good also for servers *does not* imply that having bfq in desktops is to be refused. As for the option that you are hinting at, I also acknowledge that it would be trivial for an admin/developer to know whether a given kernel is meant for a desktop/personal system, while it is more difficult to choose explicitly among the various I/O schedulers available. So, I apologize for my shortsighted, initial reply, and ask you if can elaborate a little more on this. I'm willing to help, if I can. Thanks, Paolo > Thanks, > Paolo > >> -- >> Jens Axboe
On Fri, Oct 19, 2018 at 4:42 AM, Linus Walleij <linus.walleij@linaro.org> wrote: > > On Wed, Oct 17, 2018 at 4:59 PM Bryan Gurney <bgurney@redhat.com> wrote: > > > I feel strongly about the prevention of users running into errors > > because of an incorrect scheduler default, because I encountered that > > situation three times in my testing with zoned block devices. The > > switch to SCSI_MQ would resolve that, since mq-deadline is the > > default, but in my case, I was using Fedora 28, which disables > > CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my > > default scheduler was cfq. > > I think we should make a patch to the kernel that makes it > impossible (even from sysfs) to choose a non-zone aware > scheduler for these devices. > > It's another topic than $SUBJECT patch though. I take this > into account in this version. > I like this idea. I don't have enough experience to write this patch myself, but I imagine something like adding "bool is_zoned_aware" to "struct elevator_type", and setting that true only for the schedulers that are currently zoned-device aware (which is currently deadline on single queue, mq-deadline on blk-mq). Thanks, Bryan
On 19/10/18 15:36, Bryan Gurney wrote: > I like this idea. I don't have enough experience to write this patch > myself, but I imagine something like adding "bool is_zoned_aware" to > "struct elevator_type", and setting that true only for the schedulers > that are currently zoned-device aware (which is currently deadline on > single queue, mq-deadline on blk-mq). I don't think this is needed currently as a) Jens is working on getting rid of the legacy path, which leaves us with mq-deadline only and Linus' patch has: + if (blk_queue_is_zoned(q)) + policy = "mq-deadline"; Which chooses mq-deadline on a zoned device. So nothing to worry about here now. All this only given Linus' patch actually gets merged. Byte, Johannes
On Fri, Oct 19, 2018 at 9:44 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On 19/10/18 15:36, Bryan Gurney wrote: >> I like this idea. I don't have enough experience to write this patch >> myself, but I imagine something like adding "bool is_zoned_aware" to >> "struct elevator_type", and setting that true only for the schedulers >> that are currently zoned-device aware (which is currently deadline on >> single queue, mq-deadline on blk-mq). > > I don't think this is needed currently as a) Jens is working on getting > rid of the legacy path, Once the legacy schedulers are gone, the default (prior to Linus' proposed patch) will be mq-deadline, which is zoned-device-aware. So the default scheduler will be "safer" for zoned devices. However, it will still be possible for users (or distro defaults) to select a non-zoned-aware scheduler, such as "none", "kyber", or "bfq" (prior to this patch). So there would still be a window for users to encounter the same problems I found when aborted commands start occurring during otherwise normal filesystem or storage activity, by drivers that are otherwise compliant with the handling characteristics of zoned block devices. > which leaves us with mq-deadline only and Linus' > patch has: > > + if (blk_queue_is_zoned(q)) > + policy = "mq-deadline"; > > Which chooses mq-deadline on a zoned device. > > So nothing to worry about here now. > > All this only given Linus' patch actually gets merged. I hope it does get merged. I keep forgetting to save my "zoned devices use deadline" udev rule on my SMR drive test machine in between reinstalls. Thanks, Bryan
On 10/19/18 2:22 AM, Pavel Machek wrote: > Hi! > >>>> Which is also the approach that I've been advocating for here, instead >>>> of a kernel patch... >>> >>> I know you've been advocating the use of udev for IO scheduler selection. >>> But do you want to force everybody to use udev? And for people who build >>> their own (usually small) systems, do you want to force them to think about >>> IO scheduler selection and writing appropriate rules? These are the >>> problems people were mentioning and I'm not sure what is your opinion on >>> this. >> >> I don't want to force everybody to use udev, use whatever you like on >> your platform. For most people that is udev, for embedded it's something >> else. As you said, distros already do this via udev. When I've had to >> do it on my systems, I've added a udev rule to do it. > > This is not really helpful. > > So you want me and everyone else and everyone on embedded to mess with > udev? No, thanks. Did you read what I wrote? > There are people booting with init=/bin/bash, too, running fsck. Would > not it be nice to use reasonable schedulers there? I can pretty much guarantee that fsck will run the same speed, regardless of scheduler. And users generally don't care about ultimate fairness on the device while running fsck... If you (or someone else) doesn't want to use udev, use whatever you want. You're doing something heavily customized at that point anyway, surely this isn't a show stopper. >> My opinion is that the kernel makes various schedulers available. >> Deciding which one to use is policy that should go into user space. >> The default should be something that's solid and works, fancier >> setups and tuning should be left to user space. > > Kernel should do reasonable thing by default, and it seems to be easy > in this case. I agree, we just differ on what we consider the reasonable choice to be.
On 10/19/18 2:42 AM, Linus Walleij wrote: > On Wed, Oct 17, 2018 at 4:59 PM Bryan Gurney <bgurney@redhat.com> wrote: > >> I feel strongly about the prevention of users running into errors >> because of an incorrect scheduler default, because I encountered that >> situation three times in my testing with zoned block devices. The >> switch to SCSI_MQ would resolve that, since mq-deadline is the >> default, but in my case, I was using Fedora 28, which disables >> CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my >> default scheduler was cfq. > > I think we should make a patch to the kernel that makes it > impossible (even from sysfs) to choose a non-zone aware > scheduler for these devices. > > It's another topic than $SUBJECT patch though. I take this > into account in this version. Yes I agree, and I'd be happy to take such a patch. The only matching we do now is mq-sched for mq-device, and vice versa. And that will be going away in 4.21, when there are no more !mq devices that use scheduling. If your device is zoned, then you should not be able to switch to a scheduler that doesn't have support for that. The right approach here would be to add a capability flag to the IO schedulers.
On 10/19/18 4:59 AM, Paolo Valente wrote: > > >> Il giorno 15 ott 2018, alle ore 20:26, Paolo Valente <paolo.valente@linaro.org> ha scritto: >> >> ... >>> This kind of policy does not belong in the kernel, at least >>> not in the current form. If we had some sort of "enable best >>> options for a desktop" then it could fall under that umbrella. >>> >> >> I don't think bfq can be considered a scheduler for only desktops any >> longer. >> > > Hi Jens, > this reply of mine went on bugging me, until I understood my mistake. > > The fact that I consider bfq good also for servers *does not* imply > that having bfq in desktops is to be refused. > > As for the option that you are hinting at, I also acknowledge that it > would be trivial for an admin/developer to know whether a given kernel > is meant for a desktop/personal system, while it is more difficult to > choose explicitly among the various I/O schedulers available. > > So, I apologize for my shortsighted, initial reply, and ask you if can > elaborate a little more on this. I'm willing to help, if I can. I think I've written about this multiple times now, but for me it really just boils down to sane default, and policy in the kernel. BFQ is very complicated, about 10K lines of code. I'm not comfortable making that the default right now - as I've mentioned in other replies, I think something like that should be driven by the distros as they will ultimately be the ones that usually get complaints about behavioral changes that impact performance adversely. This isn't just about running some benchmarks and calling it a day. Maybe some day we can make it the default on mq for single queue devices, but I just don't think we are there yet in terms of coverage. While I don't work for a distro anymore, I do have my hands dirty with a fairly substantial deployment at work. There we run mq-deadline on single queue devices, and kyber on multiqueue capable devices.
Hi. On 16.10.2018 19:35, Jens Axboe wrote: > Do you have anything more recent? All of these predate the current > code (by a lot), and isn't even mq. I'm mostly just interested in > plain fast NVMe device, and a big box hardware raid setup with > a ton of drives. > > I do still think that this should be going through the distros, they > need to be the ones driving this, as they will ultimately be the > ones getting customer reports on regressions. The qual/test cycle > they do is useful for this. In mainline, if we make a change like > this, we'll figure out if it worked many releases down the line. Some benchmarks here for a non-RAID setup obtained by S suite. This is from Lenovo T460s with SAMSUNG MZNTY256HDHP-000L7 SSD. v4.19 kernel is running with all recent BFQ patches applied. # replayed gnome terminal startup throughput # Workload bfq mq-deadline 0r-raw_seq 13.2617 13.4867 10r-raw_seq 512.507 539.95 # replayed gnome terminal startup time # Workload bfq mq-deadline 0r-raw_seq 0.43 0.4 10r-raw_seq 0.685 4.1625 # replayed lowriter startup throughput # Workload bfq mq-deadline 0r-raw_seq 9.985 10.375 10r-raw_seq 516.62 539.61 # replayed lowriter startup time # Workload bfq mq-deadline 0r-raw_seq 0.4 0.3875 10r-raw_seq 0.535 2.3875 # replayed xterm startup throughput # Workload bfq mq-deadline 0r-raw_seq 5.93833 6.10834 10r-raw_seq 524.447 539.991 # replayed xterm startup time # Workload bfq mq-deadline 0r-raw_seq 0.23 0.23 10r-raw_seq 0.38 1.56 # throughput # Workload bfq mq-deadline 10r-raw_rand 362.446 363.817 10r-raw_seq 537.646 540.609 1r-raw_seq 500.733 502.526 Throughput-wise, BFQ is on-par with mq-deadline. Latency-wise, BFQ is much-much better.
diff --git a/block/elevator.c b/block/elevator.c index 8fdcd64ae12e..6e6048ca3471 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q, } /* - * For blk-mq devices, we default to using mq-deadline, if available, for single - * queue devices. If deadline isn't available OR we have multiple queues, - * default to "none". + * For blk-mq devices, we default to using: + * - "none" for multiqueue devices (nr_hw_queues != 1) + * - "bfq", if available, for single queue devices + * - "mq-deadline" if "bfq" is not available for single queue devices + * - "none" for single queue devices as well as last resort */ int elevator_init_mq(struct request_queue *q) { struct elevator_type *e; + const char *policy; int err = 0; if (q->nr_hw_queues != 1) @@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q) if (unlikely(q->elevator)) goto out_unlock; - e = elevator_get(q, "mq-deadline", false); + /* + * Zoned devices must use a deadline scheduler because currently + * that is the only scheduler respecting zoned writes. + */ + if (blk_queue_is_zoned(q)) + policy = "mq-deadline"; + else if (IS_ENABLED(CONFIG_IOSCHED_BFQ)) + policy = "bfq"; + else + policy = "mq-deadline"; + + e = elevator_get(q, policy, false); if (!e) goto out_unlock;
This sets BFQ as the default scheduler for single queue block devices (nr_hw_queues == 1) if it is available. This affects notably MMC/SD-cards but also UBI and the loopback device. I have been running it for a while without any negative effects on my pet systems and I want some wider testing so let's throw it out there and see what people say. Admittedly my use cases are limited. I need to keep this patch around for my personal needs anyway. We take special care to avoid using BFQ on zoned devices (in particular SMR, shingled magnetic recording devices) as these currently require mq-deadline to group writes together. I have opted against introducing any default scheduler through Kconfig as the mq-deadline enforcement for zoned devices has to be done at runtime anyways and too many config options will make things confusing. My argument for setting a default policy in the kernel as opposed to user space is the "reasonable defaults" type, analogous to how we have one default CPU scheduling policy (CFS) that make most sense for most tasks, and how automatic process group scheduling happens in most distributions without userspace involvement. The BFQ scheduling policy makes most sense for single hardware queue devices and many embedded systems will not have the clever userspace tools (such as udev) to make an educated choice of scheduling policy. Defaults should be those that make most sense for the hardware. Cc: Pavel Machek <pavel@ucw.cz> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jan Kara <jack@suse.cz> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Mark Brown <broonie@kernel.org> Cc: Damien Le Moal <Damien.LeMoal@wdc.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> --- ChangeLog v1->v2: - Add a quirk so that devices with zoned writes are forced to use the deadline scheduler, this is necessary since only that scheduler supports zoned writes. - There is a summary article in LWN for subscribers: https://lwn.net/Articles/767987/ --- block/elevator.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)