Message ID | ef431f1c-e3c6-4940-bb2a-f5131ca96855@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Feb 21, 2017 at 3:15 PM, Jens Axboe <axboe@kernel.dk> wrote: > > But under a device managed by blk-mq, that device exposes a number of > hardware queues. For older style devices, that number is typically 1 > (single queue). ... but why would this ever be different from the normal IO scheduler? IOW, what makes single-queue mq scheduling so special that (a) it needs its own config option (b) it is different from just the regular IO scheduler in the first place? So the whole thing stinks. The fact that it then has an incomprehensible config option seems to be just gravy on top of the crap. > "none" just means that we don't have a scheduler attached. .. which makes no sense to me in the first place. People used to try to convince us that doing IO schedulers was a mistake, because modern disk hardware did a better job than we could do in software. Those people were full of crap. The regular IO scheduler used to have a "NONE" option too. Maybe it even still has one, but only insane people actually use it. Why is the MQ stuff magically so different that NONE would make sense at all>? And equally importantly: why do we _ask_ people these issues? Is this some kind of sick "cover your ass" thing, where you can say "well, I asked about it", when inevitably the choice ends up being the wrong one? We have too damn many Kconfig options as-is, I'm trying to push back on them. These two options seem fundamentally broken and stupid. The "we have no good idea, so let's add a Kconfig option" seems like a broken excuse for these things existing. So why ask this question in the first place? Is there any possible reason why "NONE" is a good option at all? And if it is the _only_ option (because no other better choice exists), it damn well shouldn't be a kconfig option! Linus
On 02/21/2017 04:23 PM, Linus Torvalds wrote: > On Tue, Feb 21, 2017 at 3:15 PM, Jens Axboe <axboe@kernel.dk> wrote: >> >> But under a device managed by blk-mq, that device exposes a number of >> hardware queues. For older style devices, that number is typically 1 >> (single queue). > > ... but why would this ever be different from the normal IO scheduler? Because we have a different set of schedulers for blk-mq, different than the legacy path. mq-deadline is a basic port that will work fine with rotational storage, but it's not going to be a good choice for NVMe because of scalability issues. We'll have BFQ on the blk-mq side, catering to the needs of those folks that currently rely on the richer feature set that CFQ supports. We've continually been working towards getting rid of the legacy IO path, and its set of schedulers. So if it's any consolation, those options will go away in the future. > IOW, what makes single-queue mq scheduling so special that > > (a) it needs its own config option > > (b) it is different from just the regular IO scheduler in the first place? > > So the whole thing stinks. The fact that it then has an > incomprehensible config option seems to be just gravy on top of the > crap. What do you mean by "the regular IO scheduler"? These are different schedulers. As explained above, single-queue mq devices generally DO want mq-deadline. multi-queue mq devices, we don't have a good choice for them right now, so we retain the current behavior (that we've had since blk-mq was introduced in 3.13) of NOT doing any IO scheduling for them. If you do want scheduling for them, set the option, or configure udev to make the right choice for you. I agree the wording isn't great, and we can improve that. But I do think that the current choices make sense. >> "none" just means that we don't have a scheduler attached. > > .. which makes no sense to me in the first place. > > People used to try to convince us that doing IO schedulers was a > mistake, because modern disk hardware did a better job than we could > do in software. > > Those people were full of crap. The regular IO scheduler used to have > a "NONE" option too. Maybe it even still has one, but only insane > people actually use it. > > Why is the MQ stuff magically so different that NONE would make sense at all>? I was never one of those people, and I've always been a strong advocate for imposing scheduling to keep devices in check. The regular IO scheduler pool includes "noop", which is probably the one you are thinking of. That one is a bit different than the new "none" option for blk-mq, in that it does do insertion sorts and it does do merges. "none" does some merging, but only where it happens to make sense. There's no insertion sorting. > And equally importantly: why do we _ask_ people these issues? Is this > some kind of sick "cover your ass" thing, where you can say "well, I > asked about it", when inevitably the choice ends up being the wrong > one? > > We have too damn many Kconfig options as-is, I'm trying to push back > on them. These two options seem fundamentally broken and stupid. > > The "we have no good idea, so let's add a Kconfig option" seems like a > broken excuse for these things existing. > > So why ask this question in the first place? > > Is there any possible reason why "NONE" is a good option at all? And > if it is the _only_ option (because no other better choice exists), it > damn well shouldn't be a kconfig option! I'm all for NOT asking questions, and not providing tunables. That's generally how I do write code. See the blk-wbt stuff, for instance, that basically just has one tunable that's set sanely by default, and we figure out the rest. I don't want to regress performance of blk-mq devices by attaching mq-deadline to them. When we do have a sane scheduler choice, we'll make that the default. And yes, maybe we can remove the Kconfig option at that point. For single queue devices, we could kill the option. But we're expecting bfq-mq for 4.12, and we'll want to have the option at that point unless you want to rely solely on runtime setting of the scheduler through udev or by the sysadmin.
On Wed, Feb 22, 2017 at 10:14 AM, Jens Axboe <axboe@kernel.dk> wrote: > > What do you mean by "the regular IO scheduler"? These are different > schedulers. Not to the user they aren't. If the user already answered once about the IO schedulers, we damn well shouldn't ask again abotu another small implementaiton detail. How hard is this to understand? You're asking users stupid things. It's not just about the wording. It's a fundamental issue. These questions are about internal implementation details. They make no sense to a user. They don't even make sense to a kernel developer, for chrissake! Don't make the kconfig mess worse. This "we can't make good defaults in the kernel, so ask users about random things that they cannot possibly answer" model is not an acceptable model. If the new schedulers aren't better than NOOP, they shouldn't exist. And if you want people to be able to test, they should be dynamic. And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? It's really that simple. Linus
On 02/22/2017 11:26 AM, Linus Torvalds wrote: > On Wed, Feb 22, 2017 at 10:14 AM, Jens Axboe <axboe@kernel.dk> wrote: >> >> What do you mean by "the regular IO scheduler"? These are different >> schedulers. > > Not to the user they aren't. > > If the user already answered once about the IO schedulers, we damn > well shouldn't ask again abotu another small implementaiton detail. > > How hard is this to understand? You're asking users stupid things. The fact is that we have two different sets, until we can yank the old ones. So I can't just ask one question, since the sets aren't identical. This IS confusing to the user, and it's an artifact of the situation that we have where we are phasing out the old IO path and switching to blk-mq. I don't want to user to know about blk-mq, I just want it to be what everything runs on. But until that happens, and it is happening, we are going to be stuck with that situation. We have this exposed in other places, too. Like for dm, and for SCSI. Not a perfect situation, but something that WILL go away eventually. > It's not just about the wording. It's a fundamental issue. These > questions are about internal implementation details. They make no > sense to a user. They don't even make sense to a kernel developer, for > chrissake! > > Don't make the kconfig mess worse. This "we can't make good defaults > in the kernel, so ask users about random things that they cannot > possibly answer" model is not an acceptable model. There are good defaults! mq single-queue should default to mq-deadline, and mq multi-queue should default to "none" for now. If you feel that strongly about it (and I'm guessing you do, judging by the speed typing and generally annoyed demeanor), then by all means, let's kill the config entries and I'll just hardwire the defaults. The config entries were implemented similarly to the old schedulers, and each scheduler is selectable individually. I'd greatly prefer just improving the wording so it makes more sense. > If the new schedulers aren't better than NOOP, they shouldn't exist. > And if you want people to be able to test, they should be dynamic. They are dynamic! You can build them as modules, you can switch at runtime. Just like we have always been able to. I can't make it more dynamic than that. We're reusing the same internal infrastructure for that, AND the user visible ABI for checking what is available, and setting a new one. > And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? BECAUSE IT'S POLICY! Fact of that matter is, if I just default to what we had before, it'd all be running with none. In a few years time, if I'm lucky, someone will have shipped udev rules setting this appropriately. If I ask the question, we'll get testing NOW. People will run with the default set.
On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? Basically, I'm pushing back on config options that I can't personally even sanely answer. If it's a config option about "do I have a particular piece of hardware", it makes sense. But these new ones were just complete garbage. The whole "default IO scheduler" thing is a disease. We should stop making up these shit schedulers and then say "we don't know which one works best for you". All it does is encourage developers to make shortcuts and create crap that isn't generically useful, and then blame the user and say "well, you should have picked a different scheduler" when they say "this does not work well for me". We have had too many of those kinds of broken choices. And when the new Kconfig options get so confusing and so esoteric that I go "Hmm, I have no idea if my hardware does a single queue or not", I put my foot down. When the IO scheduler questions were about a generic IO scheduler for everything, I can kind of understand them. I think it was still a mistake (for the reasons outline above), but at least it was a comprehensible question to ask. But when it gets to "what should I do about a single-queue version of a MQ scheduler", the question is no longer even remotely sensible. The question should simply NOT EXIST. There is no possible valid reason to ask that kind of crap. Linus
On 02/22/2017 11:42 AM, Linus Torvalds wrote: > On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? > > Basically, I'm pushing back on config options that I can't personally > even sanely answer. I got that much, and I don't disagree on that part. > If it's a config option about "do I have a particular piece of > hardware", it makes sense. But these new ones were just complete > garbage. > > The whole "default IO scheduler" thing is a disease. We should stop > making up these shit schedulers and then say "we don't know which one > works best for you". > > All it does is encourage developers to make shortcuts and create crap > that isn't generically useful, and then blame the user and say "well, > you should have picked a different scheduler" when they say "this does > not work well for me". > > We have had too many of those kinds of broken choices. And when the > new Kconfig options get so confusing and so esoteric that I go "Hmm, I > have no idea if my hardware does a single queue or not", I put my foot > down. > > When the IO scheduler questions were about a generic IO scheduler for > everything, I can kind of understand them. I think it was still a > mistake (for the reasons outline above), but at least it was a > comprehensible question to ask. > > But when it gets to "what should I do about a single-queue version of > a MQ scheduler", the question is no longer even remotely sensible. The > question should simply NOT EXIST. There is no possible valid reason to > ask that kind of crap. OK, so here's what I'll do: 1) We'll kill the default scheduler choices. sq blk-mq will default to mq-deadline, mq blk-mq will default to "none" (at least for now, until the new scheduler is done). 2) The individual schedulers will be y/m/n selectable, just like any other driver. I hope that works for everyone.
On Wed, Feb 22, 2017 at 10:41 AM, Jens Axboe <axboe@kernel.dk> wrote: > > The fact is that we have two different sets, until we can yank > the old ones. So I can't just ask one question, since the sets > aren't identical. Bullshit. I'm, saying: rip out the question ENTIRELY. For *both* cases. If you cannot yourself give a good answer, then there's no f*cking way any user can give a good answer. So asking the question is totally and utterly pointless. All it means is that different people will try different (in random ways) configurations, and the end result is random crap. So get rid of those questions. Pick a default, and live with it. And if people complain about performance, fix the performance issue. It's that simple. Linus
On 02/22/2017 11:45 AM, Linus Torvalds wrote: > On Wed, Feb 22, 2017 at 10:41 AM, Jens Axboe <axboe@kernel.dk> wrote: >> >> The fact is that we have two different sets, until we can yank >> the old ones. So I can't just ask one question, since the sets >> aren't identical. > > Bullshit. > > I'm, saying: rip out the question ENTIRELY. For *both* cases. > > If you cannot yourself give a good answer, then there's no f*cking way > any user can give a good answer. So asking the question is totally and > utterly pointless. > > All it means is that different people will try different (in random > ways) configurations, and the end result is random crap. > > So get rid of those questions. Pick a default, and live with it. And > if people complain about performance, fix the performance issue. > > It's that simple. No, it's not that simple at all. Fact is, some optimizations make sense for some workloads, and some do not. CFQ works great for some cases, and it works poorly for others, even if we try to make heuristics that enable it to work well for all cases. Some optimizations are costly, that's fine on certain types of hardware or maybe that's a trade off you want to make. Or we end up with tons of settings for a single driver, that does not reduce the configuration matrix at all. By that logic, why do we have ANY config options outside of what drivers to build? What should I set HZ at? RCU options? Let's default to ext4, and kill off xfs? Or btrfs? slab/slob/slub/whatever? Yes, that's taking the argument a bit more to the extreme, but it's the same damn thing. I'm fine with getting rid of the default selections, but we're NOT going to be able to have just one scheduler for everything. We can make sane defaults based on the hardware type.
On Wed, Feb 22, 2017 at 10:52 AM, Jens Axboe <axboe@kernel.dk> wrote: >> >> It's that simple. > > No, it's not that simple at all. Fact is, some optimizations make sense > for some workloads, and some do not. Are you even listening? I'm saying no user can ever give a sane answer to your question. The question is insane and wrong. I already said you can have a dynamic configuration (and maybe even an automatic heuristic - like saying that a ramdisk gets NOOP by default, real hardware does not). But asking a user at kernel config time for a default is insane. If *you* cannot answer it, then the user sure as hell cannot. Other configuration questions have problems too, but at least the question about "should I support ext4" is something a user (or distro) can sanely answer. So your comparisons are pure bullshit. Linus
On 02/22/2017 11:56 AM, Linus Torvalds wrote: > On Wed, Feb 22, 2017 at 10:52 AM, Jens Axboe <axboe@kernel.dk> wrote: >>> >>> It's that simple. >> >> No, it's not that simple at all. Fact is, some optimizations make sense >> for some workloads, and some do not. > > Are you even listening? > > I'm saying no user can ever give a sane answer to your question. The > question is insane and wrong. > > I already said you can have a dynamic configuration (and maybe even an > automatic heuristic - like saying that a ramdisk gets NOOP by default, > real hardware does not). > > But asking a user at kernel config time for a default is insane. If > *you* cannot answer it, then the user sure as hell cannot. > > Other configuration questions have problems too, but at least the > question about "should I support ext4" is something a user (or distro) > can sanely answer. So your comparisons are pure bullshit. As per the previous email, this was my proposed solution: OK, so here's what I'll do: 1) We'll kill the default scheduler choices. sq blk-mq will default to mq-deadline, mq blk-mq will default to "none" (at least for now, until the new scheduler is done). 2) The individual schedulers will be y/m/n selectable, just like any other driver. Any further settings on that can be done at runtime, through sysfs.
On Wed, Feb 22, 2017 at 10:58 AM, Jens Axboe <axboe@kernel.dk> wrote: > On 02/22/2017 11:56 AM, Linus Torvalds wrote: > > OK, so here's what I'll do: > > 1) We'll kill the default scheduler choices. sq blk-mq will default to > mq-deadline, mq blk-mq will default to "none" (at least for now, until > the new scheduler is done). > 2) The individual schedulers will be y/m/n selectable, just like any > other driver. Yes. That makes sense as options. I can (or, perhaps even more importantly, a distro can) answer those kinds of questions. Linus
On 02/22/2017 12:04 PM, Linus Torvalds wrote: > On Wed, Feb 22, 2017 at 10:58 AM, Jens Axboe <axboe@kernel.dk> wrote: >> On 02/22/2017 11:56 AM, Linus Torvalds wrote: >> >> OK, so here's what I'll do: >> >> 1) We'll kill the default scheduler choices. sq blk-mq will default to >> mq-deadline, mq blk-mq will default to "none" (at least for now, until >> the new scheduler is done). >> 2) The individual schedulers will be y/m/n selectable, just like any >> other driver. > > Yes. That makes sense as options. I can (or, perhaps even more > importantly, a distro can) answer those kinds of questions. Someone misspelled pacman: parman (PARMAN) [N/m/y] (NEW) ? There is no help available for this option. Or I think it's pacman, because I have no idea what else it could be. I'm going to say N.
On 2017.02.22 at 11:44 -0700, Jens Axboe wrote: > On 02/22/2017 11:42 AM, Linus Torvalds wrote: > > On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? > > > > Basically, I'm pushing back on config options that I can't personally > > even sanely answer. > > I got that much, and I don't disagree on that part. > > > If it's a config option about "do I have a particular piece of > > hardware", it makes sense. But these new ones were just complete > > garbage. > > > > The whole "default IO scheduler" thing is a disease. We should stop > > making up these shit schedulers and then say "we don't know which one > > works best for you". > > > > All it does is encourage developers to make shortcuts and create crap > > that isn't generically useful, and then blame the user and say "well, > > you should have picked a different scheduler" when they say "this does > > not work well for me". > > > > We have had too many of those kinds of broken choices. And when the > > new Kconfig options get so confusing and so esoteric that I go "Hmm, I > > have no idea if my hardware does a single queue or not", I put my foot > > down. > > > > When the IO scheduler questions were about a generic IO scheduler for > > everything, I can kind of understand them. I think it was still a > > mistake (for the reasons outline above), but at least it was a > > comprehensible question to ask. > > > > But when it gets to "what should I do about a single-queue version of > > a MQ scheduler", the question is no longer even remotely sensible. The > > question should simply NOT EXIST. There is no possible valid reason to > > ask that kind of crap. > > OK, so here's what I'll do: > > 1) We'll kill the default scheduler choices. sq blk-mq will default to > mq-deadline, mq blk-mq will default to "none" (at least for now, until > the new scheduler is done). But what about e.g. SATA SSDs? Wouldn't they be better off without any scheduler? So perhaps setting "none" for queue/rotational==0 and mq-deadline for spinning drives automatically in the sq blk-mq case?
On 02/22/2017 02:50 PM, Markus Trippelsdorf wrote: > On 2017.02.22 at 11:44 -0700, Jens Axboe wrote: >> On 02/22/2017 11:42 AM, Linus Torvalds wrote: >>> On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds >>> <torvalds@linux-foundation.org> wrote: >>>> >>>> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? >>> >>> Basically, I'm pushing back on config options that I can't personally >>> even sanely answer. >> >> I got that much, and I don't disagree on that part. >> >>> If it's a config option about "do I have a particular piece of >>> hardware", it makes sense. But these new ones were just complete >>> garbage. >>> >>> The whole "default IO scheduler" thing is a disease. We should stop >>> making up these shit schedulers and then say "we don't know which one >>> works best for you". >>> >>> All it does is encourage developers to make shortcuts and create crap >>> that isn't generically useful, and then blame the user and say "well, >>> you should have picked a different scheduler" when they say "this does >>> not work well for me". >>> >>> We have had too many of those kinds of broken choices. And when the >>> new Kconfig options get so confusing and so esoteric that I go "Hmm, I >>> have no idea if my hardware does a single queue or not", I put my foot >>> down. >>> >>> When the IO scheduler questions were about a generic IO scheduler for >>> everything, I can kind of understand them. I think it was still a >>> mistake (for the reasons outline above), but at least it was a >>> comprehensible question to ask. >>> >>> But when it gets to "what should I do about a single-queue version of >>> a MQ scheduler", the question is no longer even remotely sensible. The >>> question should simply NOT EXIST. There is no possible valid reason to >>> ask that kind of crap. >> >> OK, so here's what I'll do: >> >> 1) We'll kill the default scheduler choices. sq blk-mq will default to >> mq-deadline, mq blk-mq will default to "none" (at least for now, until >> the new scheduler is done). > > But what about e.g. SATA SSDs? Wouldn't they be better off without any > scheduler? Marginal. If they are single queue, using a basic scheduler like deadline isn't going to be a significant amount of overhead. In some cases they are going to be better off, due to better merging. In the worst case, overhead is slightly higher. Net result is positive, I'd say. > So perhaps setting "none" for queue/rotational==0 and mq-deadline for > spinning drives automatically in the sq blk-mq case? You can do that through a udev rule. The kernel doesn't know if the device is rotational or not when we set up the scheduler. So we'd either have to add code to do that, or simply just do it with a udev rule. I'd prefer the latter.
On Wed, Feb 22, 2017 at 1:50 PM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > > But what about e.g. SATA SSDs? Wouldn't they be better off without any > scheduler? > So perhaps setting "none" for queue/rotational==0 and mq-deadline for > spinning drives automatically in the sq blk-mq case? Jens already said that the merging advantage can outweigh the costs, but he didn't actually talk much about it. The scheduler advantage can outweigh the costs of running a scheduler by an absolutely _huge_ amount. An SSD isn't zero-cost, and each command tends to have some fixed overhead on the controller, and pretty much all SSD's heavily prefer fewer large request over lots of tiny ones. There are also fairness/latency issues that tend to very heavily favor having an actual scheduler, ie reads want to be scheduled before writes on an SSD (within reason) in order to make latency better. Ten years ago, there were lots of people who argued that you don't want to do do scheduling for SSD's, because SSD's were so fast that you only added overhead. Nobody really believes that fairytale any more. So you might have particular loads that look better with noop, but they will be rare and far between. Try it, by all means, and if it works for you, set it in your udev rules. The main place where a noop scheduler currently might make sense is likely for a ramdisk, but quite frankly, since the main real usecase for a ram-disk tends to be to make it easy to profile and find the bottlenecks for performance analysis (for emulating future "infinitely fast" media), even that isn't true - using noop there defeats the whole purpose. Linus
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 0715ce93daef..f6144c5d7c70 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -75,7 +75,7 @@ config MQ_IOSCHED_NONE choice prompt "Default single-queue blk-mq I/O scheduler" - default DEFAULT_SQ_NONE + default DEFAULT_SQ_DEADLINE if MQ_IOSCHED_DEADLINE=y help Select the I/O scheduler which will be used by default for blk-mq managed block devices with a single queue.