Message ID | 20240123183332.876854-3-shikemeng@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix and cleanups to page-writeback | expand |
On Wed, Jan 24, 2024 at 02:33:29AM +0800, Kemeng Shi wrote: > The wb_calc_thresh will calculate wb's share in global wb domain. We need > to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh > instead of wb_calc_thresh to fix this. That function is calculating the wb's portion of wb portion in the whole system so that threshold can be distributed accordingly. So, it has to be compared in the global domain. If you look at the comment on top of struct wb_domain, it says: /* * A wb_domain represents a domain that wb's (bdi_writeback's) belong to * and are measured against each other in. There always is one global * domain, global_wb_domain, that every wb in the system is a member of. * This allows measuring the relative bandwidth of each wb to distribute * dirtyable memory accordingly. */ Also, how is this tested? Was there a case where the existing code misbehaved that's improved by this patch? Or is this just from reading code? Thanks.
on 1/24/2024 4:43 AM, Tejun Heo wrote: > On Wed, Jan 24, 2024 at 02:33:29AM +0800, Kemeng Shi wrote: >> The wb_calc_thresh will calculate wb's share in global wb domain. We need >> to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh >> instead of wb_calc_thresh to fix this. > > That function is calculating the wb's portion of wb portion in the whole > system so that threshold can be distributed accordingly. So, it has to be > compared in the global domain. If you look at the comment on top of struct > wb_domain, it says: > > /* > * A wb_domain represents a domain that wb's (bdi_writeback's) belong to > * and are measured against each other in. There always is one global > * domain, global_wb_domain, that every wb in the system is a member of. > * This allows measuring the relative bandwidth of each wb to distribute > * dirtyable memory accordingly. > */ > Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb domain and a cgroup domain. I agree the way how we calculate wb's threshold in global domain as you described above. This patch tries to fix calculation of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb, mdtc->bg_thresh)), means: (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*) The cgroup domain threshold is (memory of cgroup domain) / (memory of system) * (system threshold). Then the wb's threshold in cgroup will be smaller than expected. Consider following domain hierarchy: global domain (100G) / \ cgroup domain1(50G) cgroup domain2(50G) | | bdi wb1 wb2 Assume wb1 and wb2 has the same bandwidth. We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G. Then we have: wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth) = 10G * 1/2 = 5G wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth) = 5G * 1/2 = 2.5G At last, wb1 and wb2 will be limited at 2.5G, the system will be limited at 5G which is less than global domain bg_thresh 10G. After the fix, threshold in cgroup domain will be: (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold) The wb1 and wb2 will be limited at 5G, the system will be limited at 10G which equals to global domain bg_thresh 10G. As I didn't take a deep look into memory cgroup, please correct me if anything is wrong. Thanks! > Also, how is this tested? Was there a case where the existing code > misbehaved that's improved by this patch? Or is this just from reading code? This is jut from reading code. Would the case showed above convince you a bit. Look forward to your reply, thanks!. > > Thanks. >
Hello, On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote: > Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb > domain and a cgroup domain. I agree the way how we calculate wb's threshold > in global domain as you described above. This patch tries to fix calculation > of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb, > mdtc->bg_thresh)), means: > (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*) > The cgroup domain threshold is > (memory of cgroup domain) / (memory of system) * (system threshold). > Then the wb's threshold in cgroup will be smaller than expected. > > Consider following domain hierarchy: > global domain (100G) > / \ > cgroup domain1(50G) cgroup domain2(50G) > | | > bdi wb1 wb2 > Assume wb1 and wb2 has the same bandwidth. > We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G. > Then we have: > wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth) > = 10G * 1/2 = 5G > wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth) > = 5G * 1/2 = 2.5G > At last, wb1 and wb2 will be limited at 2.5G, the system will be limited > at 5G which is less than global domain bg_thresh 10G. > > After the fix, threshold in cgroup domain will be: > (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold) > The wb1 and wb2 will be limited at 5G, the system will be limited at > 10G which equals to global domain bg_thresh 10G. > > As I didn't take a deep look into memory cgroup, please correct me if > anything is wrong. Thanks! > > Also, how is this tested? Was there a case where the existing code > > misbehaved that's improved by this patch? Or is this just from reading code? > > This is jut from reading code. Would the case showed above convince you > a bit. Look forward to your reply, thanks!. So, the explanation makes some sense to me but can you please construct a case to actually demonstrate the problem and fix? I don't think it'd be wise to apply the change without actually observing the code change does what it says it does. Thanks.
on 1/30/2024 5:00 AM, Tejun Heo wrote: > Hello, > > On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote: >> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb >> domain and a cgroup domain. I agree the way how we calculate wb's threshold >> in global domain as you described above. This patch tries to fix calculation >> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb, >> mdtc->bg_thresh)), means: >> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*) >> The cgroup domain threshold is >> (memory of cgroup domain) / (memory of system) * (system threshold). >> Then the wb's threshold in cgroup will be smaller than expected. >> >> Consider following domain hierarchy: >> global domain (100G) >> / \ >> cgroup domain1(50G) cgroup domain2(50G) >> | | >> bdi wb1 wb2 >> Assume wb1 and wb2 has the same bandwidth. >> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G. >> Then we have: >> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth) >> = 10G * 1/2 = 5G >> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth) >> = 5G * 1/2 = 2.5G >> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited >> at 5G which is less than global domain bg_thresh 10G. >> >> After the fix, threshold in cgroup domain will be: >> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold) >> The wb1 and wb2 will be limited at 5G, the system will be limited at >> 10G which equals to global domain bg_thresh 10G. >> >> As I didn't take a deep look into memory cgroup, please correct me if >> anything is wrong. Thanks! >>> Also, how is this tested? Was there a case where the existing code >>> misbehaved that's improved by this patch? Or is this just from reading code? >> >> This is jut from reading code. Would the case showed above convince you >> a bit. Look forward to your reply, thanks!. > > So, the explanation makes some sense to me but can you please construct a > case to actually demonstrate the problem and fix? I don't think it'd be wise > to apply the change without actually observing the code change does what it > says it does. Hi Tejun, sorry for the delay as I found there is a issue that keep triggering writeback even the dirty page is under dirty background threshold. The issue make it difficult to observe the expected improvment from this patch. I try to fix it in [1] and test this patch based on the fix patches. Run test as following: /* make background writeback easier to observe */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* enable memory and io cgroup */ echo "+memory +io" > /sys/fs/cgroup/cgroup.subtree_control /* run fio in group1 with shell */ cd /sys/fs/cgroup mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0 /* run another fio in group2 with another shell */ cd /sys/fs/cgroup mkdir group2 cd group2 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdc mount /dev/vdc /bdi2/ fio -name test -filename=/bdi2/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0 Before the fix we got (result of three tests): fio1 WRITE: bw=1304MiB/s (1367MB/s), 1304MiB/s-1304MiB/s (1367MB/s-1367MB/s), io=76.4GiB (82.0GB), run=60001-60001msec WRITE: bw=1351MiB/s (1417MB/s), 1351MiB/s-1351MiB/s (1417MB/s-1417MB/s), io=79.2GiB (85.0GB), run=60001-60001msec WRITE: bw=1373MiB/s (1440MB/s), 1373MiB/s-1373MiB/s (1440MB/s-1440MB/s), io=80.5GiB (86.4GB), run=60001-60001msec fio2 WRITE: bw=1134MiB/s (1190MB/s), 1134MiB/s-1134MiB/s (1190MB/s-1190MB/s), io=66.5GiB (71.4GB), run=60001-60001msec WRITE: bw=1414MiB/s (1483MB/s), 1414MiB/s-1414MiB/s (1483MB/s-1483MB/s), io=82.8GiB (88.0GB), run=60001-60001msec WRITE: bw=1469MiB/s (1540MB/s), 1469MiB/s-1469MiB/s (1540MB/s-1540MB/s), io=86.0GiB (92.4GB), run=60001-60001msec After the fix we got (result of three tests): fio1 WRITE: bw=1719MiB/s (1802MB/s), 1719MiB/s-1719MiB/s (1802MB/s-1802MB/s), io=101GiB (108GB), run=60001-60001msec WRITE: bw=1723MiB/s (1806MB/s), 1723MiB/s-1723MiB/s (1806MB/s-1806MB/s), io=101GiB (108GB), run=60001-60001msec WRITE: bw=1691MiB/s (1774MB/s), 1691MiB/s-1691MiB/s (1774MB/s-1774MB/s), io=99.2GiB (106GB), run=60036-60036msec fio2 WRITE: bw=1692MiB/s (1774MB/s), 1692MiB/s-1692MiB/s (1774MB/s-1774MB/s), io=99.1GiB (106GB), run=60001-60001msec WRITE: bw=1681MiB/s (1763MB/s), 1681MiB/s-1681MiB/s (1763MB/s-1763MB/s), io=98.5GiB (106GB), run=60001-60001msec WRITE: bw=1671MiB/s (1752MB/s), 1671MiB/s-1671MiB/s (1752MB/s-1752MB/s), io=97.9GiB (105GB), run=60001-60001msec I also add code to print the pages written in writeback and pages written in writeback reduce a lot and are rare after this fix. [1] https://lore.kernel.org/linux-fsdevel/20240208172024.23625-2-shikemeng@huaweicloud.com/T/#u > > Thanks. >
Hello, Kemeng. On Thu, Feb 08, 2024 at 05:26:10PM +0800, Kemeng Shi wrote: > Hi Tejun, sorry for the delay as I found there is a issue that keep triggering > writeback even the dirty page is under dirty background threshold. The issue > make it difficult to observe the expected improvment from this patch. I try to > fix it in [1] and test this patch based on the fix patches. > Run test as following: Ah, that looks promising and thanks a lot for looking into this. It's great to have someone actually poring over the code and behavior. Understanding the wb and cgroup wb behaviors have always been challenging because the only thing we have is the tracepoints and it's really tedious and difficult to build an overall understanding from the trace outputs. Can I persuade you into writing a drgn monitoring script similar to e.g. tools/workqueues/wq_monitor.py? I think there's a pretty good chance the visibility can be improved substantially. Thanks.
on 2/9/2024 3:32 AM, Tejun Heo wrote: > Hello, Kemeng. > > On Thu, Feb 08, 2024 at 05:26:10PM +0800, Kemeng Shi wrote: >> Hi Tejun, sorry for the delay as I found there is a issue that keep triggering >> writeback even the dirty page is under dirty background threshold. The issue >> make it difficult to observe the expected improvment from this patch. I try to >> fix it in [1] and test this patch based on the fix patches. >> Run test as following: > > Ah, that looks promising and thanks a lot for looking into this. It's great > to have someone actually poring over the code and behavior. Understanding > the wb and cgroup wb behaviors have always been challenging because the only > thing we have is the tracepoints and it's really tedious and difficult to > build an overall understanding from the trace outputs. Can I persuade you > into writing a drgn monitoring script similar to e.g. > tools/workqueues/wq_monitor.py? I think there's a pretty good chance the > visibility can be improved substantially. Hi Tejun, sorry for the late reply as I was on vacation these days. I agree that visibility is poor so I have to add some printks to debug. Actually, I have added per wb stats to improve the visibility as we only have per bdi stats (/sys/kernel/debug/bdi/xxx:x/stats) now. And I plan to submit it in a new series. I'd like to add a script to improve visibility more but I can't ensure the time to do it. I would submit the monitoring script with the per wb stats if the time problem does not bother you. Thanks. > > Thanks. >
Hello, On Sun, Feb 18, 2024 at 10:35:41AM +0800, Kemeng Shi wrote: > Hi Tejun, sorry for the late reply as I was on vacation these days. > I agree that visibility is poor so I have to add some printks to debug. > Actually, I have added per wb stats to improve the visibility as we > only have per bdi stats (/sys/kernel/debug/bdi/xxx:x/stats) now. And I > plan to submit it in a new series. > I'd like to add a script to improve visibility more but I can't ensure > the time to do it. I would submit the monitoring script with the per wb > stats if the time problem does not bother you. It has had poor visiblity for many many years, I don't think we're in any hurry. Thanks.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9268859722c4..f6c7f3b0f495 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2118,7 +2118,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) if (mdtc->dirty > mdtc->bg_thresh) return true; - thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh); + thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh); if (thresh < 2 * wb_stat_error()) reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); else
The wb_calc_thresh will calculate wb's share in global wb domain. We need to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh instead of wb_calc_thresh to fix this. Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> --- mm/page-writeback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)