Message ID | 20211006164110.10817-1-jack@suse.cz (mailing list archive) |
---|---|
Headers | show |
Series | bfq: Limit number of allocated scheduler tags per cgroup | expand |
> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto: > > Hello! > > Here is the third revision of my patches to fix how bfq weights apply on cgroup > throughput and on throughput of processes with different IO priorities. Since > v2 I've added some more patches so that now IO priorities also result in > service differentiation (previously they had no effect on service > differentiation on some workloads similarly to cgroup weights). The last patch > in the series still needs some work as in the current state it causes a > notable regression (~20-30%) with dbench benchmark for large numbers of > clients. I've verified that the last patch is indeed necessary for the service > differentiation with the workload described in its changelog. As we discussed > with Paolo, I have also found out that if I remove the "waker has enough > budget" condition from bfq_select_queue(), dbench performance is restored > and the service differentiation is still good. But we probably need some > justification or cleaner solution than just removing the condition so that > is still up to discussion. But first seven patches already noticeably improve > the situation for lots of workloads so IMO they stand on their own and > can be merged regardless of how we go about the last patch. > Hi Jan, I have just one more (easy-to-resolve) doubt: you seem to have tested these patches mostly on the throughput side. Did you run a startup-latency test as well? I can run some for you, if you prefer so. Just give me a few days. When that is cleared, your first seven patches are ok for me. Actually I think they are a very good and relevant contribution. Patch number eight probably deserve some ore analysis, as you pointed out. As I already told you in our call, we can look at that budget condition together. And figure out the best tests to use, to check whether I/O control does not get lost too much. Thanks, Paolo > Changes since v2: > * Rebased on top of current Linus' tree > * Updated computation of scheduler tag proportions to work correctly even > for processes within the same cgroup but with different IO priorities > * Added comment roughly explaining why we limit tag depth > * Added patch limiting waker / wakee detection in time so avoid at least the > most obvious false positives > * Added patch to log waker / wakee detections in blktrace for better debugging > * Added patch properly account injected IO > > Changes since v1: > * Fixed computation of appropriate proportion of scheduler tags for a cgroup > to work with deeper cgroup hierarchies. > > Original cover letter: > > I was looking into why cgroup weights do not have any measurable impact on > writeback throughput from different cgroups. This actually a regression from > CFQ where things work more or less OK and weights have roughly the impact they > should. The problem can be reproduced e.g. by running the following easy fio > job in two cgroups with different weight: > > [writer] > directory=/mnt/repro/ > numjobs=1 > rw=write > size=8g > time_based > runtime=30 > ramp_time=10 > blocksize=1m > direct=0 > ioengine=sync > > I can observe there's no significat difference in the amount of data written > from different cgroups despite their weights are in say 1:3 ratio. > > After some debugging I've understood the dynamics of the system. There are two > issues: > > 1) The amount of scheduler tags needs to be significantly larger than the > amount of device tags. Otherwise there are not enough requests waiting in BFQ > to be dispatched to the device and thus there's nothing to schedule on. > > 2) Even with enough scheduler tags, writers from two cgroups eventually start > contending on scheduler tag allocation. These are served on first come first > served basis so writers from both cgroups feed requests into bfq with > approximately the same speed. Since bfq prefers IO from heavier cgroup, that is > submitted and completed faster and eventually we end up in a situation when > there's no IO from the heavier cgroup in bfq and all scheduler tags are > consumed by requests from the lighter cgroup. At that point bfq just dispatches > lots of the IO from the lighter cgroup since there's no contender for disk > throughput. As a result observed throughput for both cgroups are the same. > > This series fixes this problem by accounting how many scheduler tags are > allocated for each cgroup and if a cgroup has more tags allocated than its > fair share (based on weights) in its service tree, we heavily limit scheduler > tag bitmap depth for it so that it is not be able to starve other cgroups from > scheduler tags. > > What do people think about this? > > Honza > > Previous versions: > Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1 > Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # v2
> Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@linaro.org> ha scritto: > > > >> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto: >> >> Hello! >> >> Here is the third revision of my patches to fix how bfq weights apply on cgroup >> throughput and on throughput of processes with different IO priorities. Since >> v2 I've added some more patches so that now IO priorities also result in >> service differentiation (previously they had no effect on service >> differentiation on some workloads similarly to cgroup weights). The last patch >> in the series still needs some work as in the current state it causes a >> notable regression (~20-30%) with dbench benchmark for large numbers of >> clients. I've verified that the last patch is indeed necessary for the service >> differentiation with the workload described in its changelog. As we discussed >> with Paolo, I have also found out that if I remove the "waker has enough >> budget" condition from bfq_select_queue(), dbench performance is restored >> and the service differentiation is still good. But we probably need some >> justification or cleaner solution than just removing the condition so that >> is still up to discussion. But first seven patches already noticeably improve >> the situation for lots of workloads so IMO they stand on their own and >> can be merged regardless of how we go about the last patch. >> > > Hi Jan, > I have just one more (easy-to-resolve) doubt: you seem to have tested > these patches mostly on the throughput side. Did you run a > startup-latency test as well? I can run some for you, if you prefer > so. Just give me a few days. > We are finally testing your patches a little bit right now, for regressions with our typical benchmarks ... Paolo > When that is cleared, your first seven patches are ok for me. > Actually I think they are a very good and relevant contribution. > Patch number eight probably deserve some ore analysis, as you pointed > out. As I already told you in our call, we can look at that budget > condition together. And figure out the best tests to use, to check > whether I/O control does not get lost too much. > > Thanks, > Paolo > >> Changes since v2: >> * Rebased on top of current Linus' tree >> * Updated computation of scheduler tag proportions to work correctly even >> for processes within the same cgroup but with different IO priorities >> * Added comment roughly explaining why we limit tag depth >> * Added patch limiting waker / wakee detection in time so avoid at least the >> most obvious false positives >> * Added patch to log waker / wakee detections in blktrace for better debugging >> * Added patch properly account injected IO >> >> Changes since v1: >> * Fixed computation of appropriate proportion of scheduler tags for a cgroup >> to work with deeper cgroup hierarchies. >> >> Original cover letter: >> >> I was looking into why cgroup weights do not have any measurable impact on >> writeback throughput from different cgroups. This actually a regression from >> CFQ where things work more or less OK and weights have roughly the impact they >> should. The problem can be reproduced e.g. by running the following easy fio >> job in two cgroups with different weight: >> >> [writer] >> directory=/mnt/repro/ >> numjobs=1 >> rw=write >> size=8g >> time_based >> runtime=30 >> ramp_time=10 >> blocksize=1m >> direct=0 >> ioengine=sync >> >> I can observe there's no significat difference in the amount of data written >> from different cgroups despite their weights are in say 1:3 ratio. >> >> After some debugging I've understood the dynamics of the system. There are two >> issues: >> >> 1) The amount of scheduler tags needs to be significantly larger than the >> amount of device tags. Otherwise there are not enough requests waiting in BFQ >> to be dispatched to the device and thus there's nothing to schedule on. >> >> 2) Even with enough scheduler tags, writers from two cgroups eventually start >> contending on scheduler tag allocation. These are served on first come first >> served basis so writers from both cgroups feed requests into bfq with >> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is >> submitted and completed faster and eventually we end up in a situation when >> there's no IO from the heavier cgroup in bfq and all scheduler tags are >> consumed by requests from the lighter cgroup. At that point bfq just dispatches >> lots of the IO from the lighter cgroup since there's no contender for disk >> throughput. As a result observed throughput for both cgroups are the same. >> >> This series fixes this problem by accounting how many scheduler tags are >> allocated for each cgroup and if a cgroup has more tags allocated than its >> fair share (based on weights) in its service tree, we heavily limit scheduler >> tag bitmap depth for it so that it is not be able to starve other cgroups from >> scheduler tags. >> >> What do people think about this? >> >> Honza >> >> Previous versions: >> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1 >> Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # v2
On Mon 25-10-21 09:58:11, Paolo Valente wrote: > > Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@linaro.org> ha scritto: > >> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto: > >> > >> Hello! > >> > >> Here is the third revision of my patches to fix how bfq weights apply on cgroup > >> throughput and on throughput of processes with different IO priorities. Since > >> v2 I've added some more patches so that now IO priorities also result in > >> service differentiation (previously they had no effect on service > >> differentiation on some workloads similarly to cgroup weights). The last patch > >> in the series still needs some work as in the current state it causes a > >> notable regression (~20-30%) with dbench benchmark for large numbers of > >> clients. I've verified that the last patch is indeed necessary for the service > >> differentiation with the workload described in its changelog. As we discussed > >> with Paolo, I have also found out that if I remove the "waker has enough > >> budget" condition from bfq_select_queue(), dbench performance is restored > >> and the service differentiation is still good. But we probably need some > >> justification or cleaner solution than just removing the condition so that > >> is still up to discussion. But first seven patches already noticeably improve > >> the situation for lots of workloads so IMO they stand on their own and > >> can be merged regardless of how we go about the last patch. > >> > > > > Hi Jan, > > I have just one more (easy-to-resolve) doubt: you seem to have tested > > these patches mostly on the throughput side. Did you run a > > startup-latency test as well? I can run some for you, if you prefer > > so. Just give me a few days. > > > > We are finally testing your patches a little bit right now, for > regressions with our typical benchmarks ... Hum, strange I didn't get your previous email about benchmarks. You're right I didn't run startup-latency AFAIR. Now that you've started them, probably there's no big point in me queuing them as well. So thanks for the benchmarking :) Honza
> Il giorno 25 ott 2021, alle ore 13:14, Jan Kara <jack@suse.cz> ha scritto: > > On Mon 25-10-21 09:58:11, Paolo Valente wrote: >>> Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>>> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto: >>>> >>>> Hello! >>>> >>>> Here is the third revision of my patches to fix how bfq weights apply on cgroup >>>> throughput and on throughput of processes with different IO priorities. Since >>>> v2 I've added some more patches so that now IO priorities also result in >>>> service differentiation (previously they had no effect on service >>>> differentiation on some workloads similarly to cgroup weights). The last patch >>>> in the series still needs some work as in the current state it causes a >>>> notable regression (~20-30%) with dbench benchmark for large numbers of >>>> clients. I've verified that the last patch is indeed necessary for the service >>>> differentiation with the workload described in its changelog. As we discussed >>>> with Paolo, I have also found out that if I remove the "waker has enough >>>> budget" condition from bfq_select_queue(), dbench performance is restored >>>> and the service differentiation is still good. But we probably need some >>>> justification or cleaner solution than just removing the condition so that >>>> is still up to discussion. But first seven patches already noticeably improve >>>> the situation for lots of workloads so IMO they stand on their own and >>>> can be merged regardless of how we go about the last patch. >>>> >>> >>> Hi Jan, >>> I have just one more (easy-to-resolve) doubt: you seem to have tested >>> these patches mostly on the throughput side. Did you run a >>> startup-latency test as well? I can run some for you, if you prefer >>> so. Just give me a few days. >>> >> >> We are finally testing your patches a little bit right now, for >> regressions with our typical benchmarks ... > > Hum, strange I didn't get your previous email about benchmarks. You're > right I didn't run startup-latency AFAIR. Now that you've started them, > probably there's no big point in me queuing them as well. So thanks for > the benchmarking :) > Hi Jan, we have executed a lot of benchmarking, on various hardware. No regression found! So, thank you very much for this important contribution, and: Acked-by: Paolo Valente <paolo.valente@linaro.org> > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR