Message ID | 20211222114111.2206248-4-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand |
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote: > Based on my understanding the optimisation with task_obj for in_task() > mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable() > is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel > instead to only PREEMPT_RT. > With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be > configured but these kernels always have preempt_disable()/enable() > present so it probably makes no sense here for the optimisation. > > Restrict the optimisation to !CONFIG_PREEMPTION kernels. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My understanding is that some distros are going to use PREEMPT_DYNAMIC, but default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to disable the optimization based on PREEMPTION alone. Regards, Longman
On 2021-12-23 16:48:41 [-0500], Waiman Long wrote: > On 12/22/21 06:41, Sebastian Andrzej Siewior wrote: > > Based on my understanding the optimisation with task_obj for in_task() > > mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable() > > is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel > > instead to only PREEMPT_RT. > > With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be > > configured but these kernels always have preempt_disable()/enable() > > present so it probably makes no sense here for the optimisation. > > > > Restrict the optimisation to !CONFIG_PREEMPTION kernels. > > > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My > understanding is that some distros are going to use PREEMPT_DYNAMIC, but > default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to > disable the optimization based on PREEMPTION alone. So there is a benefit to this even if preempt_disable() is not optimized away? My understanding was that this depends on preempt_disable() being optimized away. Is there something you recommend as a benchmark where I could get some numbers? > Regards, > Longman Sebastian
On 1/3/22 09:44, Sebastian Andrzej Siewior wrote: > On 2021-12-23 16:48:41 [-0500], Waiman Long wrote: >> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote: >>> Based on my understanding the optimisation with task_obj for in_task() >>> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable() >>> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel >>> instead to only PREEMPT_RT. >>> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be >>> configured but these kernels always have preempt_disable()/enable() >>> present so it probably makes no sense here for the optimisation. >>> >>> Restrict the optimisation to !CONFIG_PREEMPTION kernels. >>> >>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My >> understanding is that some distros are going to use PREEMPT_DYNAMIC, but >> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to >> disable the optimization based on PREEMPTION alone. > So there is a benefit to this even if preempt_disable() is not optimized > away? My understanding was that this depends on preempt_disable() being > optimized away. > Is there something you recommend as a benchmark where I could get some > numbers? In the case of PREEMPT_DYNAMIC, it depends on the default setting which is used by most users. I will support disabling the optimization if defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by CONFIG_)PREEMPTION alone. As for microbenchmark, something that makes a lot of calls to malloc() or related allocations can be used. Cheers, Longman
On 2022-01-03 10:04:29 [-0500], Waiman Long wrote: > On 1/3/22 09:44, Sebastian Andrzej Siewior wrote: > > Is there something you recommend as a benchmark where I could get some > > numbers? > > In the case of PREEMPT_DYNAMIC, it depends on the default setting which is > used by most users. I will support disabling the optimization if > defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by > CONFIG_)PREEMPTION alone. > > As for microbenchmark, something that makes a lot of calls to malloc() or > related allocations can be used. Numbers I made: Sandy Bridge Haswell Skylake AMD-A8 7100 Zen2 ARM64 PREEMPT 5,123,896,822 5,215,055,226 5,077,611,590 6,012,287,874 6,234,674,489 20,000,000,100 IRQ 7,494,119,638 6,810,367,629 10,620,130,377 4,178,546,086 4,898,076,012 13,538,461,925 For micro benchmarking I did 1.000.000.000 iterations of preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore() [IRQ]. On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ loop took 7,494,119,638ns. The absolute numbers are not important, it is worth noting that preemption off/on is less expensive than IRQ off/on. Except for AMD and ARM64 where IRQ off/on was less expensive. The whole loop was performed with disabled interrupts so I don't expect much interference - but then I don't know much the µArch optimized away on local_irq_restore() given that the interrupts were already disabled. I don't have any recent Intel HW (I think) so I don't know if this is an Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess that the recent uArch would behave similar to AMD. Moving on: For the test I run 100,000,000 iterations of kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT)); The BH suffix means that in_task() reported false during the allocation, otherwise it reported true. SD is the standard deviation. SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT. OPT means the optimisation (in_task() + task_obj) is active, NO-OPT means no optimisation (irq_obj is always used). The numbers are the time in ns needed for 100,000,000 iteration (alloc + free). I run 5 tests and used the median value here. If the standard deviation exceeded 10^9 then I repeated the test. The values remained in the same range since usually only one value was off and the other remained in the same range. Sandy Bridge SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT ALLOC/FREE 8,519,295,176 9,051,200,652 10,627,431,395 11,198,189,843 SD 5,309,768 29,253,976 129,102,317 40,681,909 ALLOC/FREE BH 9,996,704,330 8,927,026,031 11,680,149,900 11,139,356,465 SD 38,237,534 72,913,120 23,626,932 116,413,331 The optimisation is visible in the SERVER-OPT case (~1.5s difference in the runtime (or 14.7ns per iteration)). There is hardly any difference between BH and !BH in the SERVER-NO-OPT case. For the SERVER case, the optimisation improves ~0.5s in runtime for the !BH case. For the PREEMPT case it also looks like ~0.5s improvement in the BH case while in the BH case it looks the other way around. DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT ALLOC/FREE 11,069,180,584 10,773,407,543 10,963,581,285 10,826,207,969 SD 23,195,912 112,763,104 13,145,589 33,543,625 ALLOC/FREE BH 11,443,342,069 10,720,094,700 11,064,914,727 10,955,883,521 SD 81,150,074 171,299,554 58,603,778 84,131,143 DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is default. I don't see any difference vs CONFIG_PREEMPT except the default preemption state (so I didn't test that). The preemption counter is always forced-in so preempt_enable()/disable() is not optimized away. SRV is the default value (PREEMPT_NONE) and FULL is the overriden (preempt=full) state. Based on that, I don't see any added value by the optimisation once PREEMPT_DYNAMIC is enabled. ---- Zen2: SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT ALLOC/FREE 8,126,735,313 8,751,307,383 9,822,927,142 10,045,105,425 SD 100,806,471 87,234,047 55,170,179 25,832,386 ALLOC/FREE BH 9,197,455,885 8,394,337,053 10,671,227,095 9,904,954,934 SD 155,223,919 57,800,997 47,529,496 105,260,566 On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I re-run it twice and nothing significant changed… However the difference on PREEMPT for the !BH case is not as significant as on Sandy Bridge (~200ms here vs ~500ms there). DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT ALLOC/FREE 9,680,498,929 10,180,973,847 9,644,453,405 10,224,416,854 SD 73,944,156 61,850,527 13,277,203 107,145,212 ALLOC/FREE BH 10,680,074,634 9,956,695,323 10,704,442,515 9,942,155,910 SD 75,535,172 34,524,493 54,625,678 87,163,920 For double testing and checking, the full git tree is available at [0] and the script to parse the results is at [1]. [0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg [1] https://breakpoint.cc/parse-memcg.py > Cheers, > Longman Sebastian
On 1/5/22 15:22, Sebastian Andrzej Siewior wrote: > On 2022-01-03 10:04:29 [-0500], Waiman Long wrote: >> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote: >>> Is there something you recommend as a benchmark where I could get some >>> numbers? >> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is >> used by most users. I will support disabling the optimization if >> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by >> CONFIG_)PREEMPTION alone. >> >> As for microbenchmark, something that makes a lot of calls to malloc() or >> related allocations can be used. > Numbers I made: > > Sandy Bridge Haswell Skylake AMD-A8 7100 Zen2 ARM64 > PREEMPT 5,123,896,822 5,215,055,226 5,077,611,590 6,012,287,874 6,234,674,489 20,000,000,100 > IRQ 7,494,119,638 6,810,367,629 10,620,130,377 4,178,546,086 4,898,076,012 13,538,461,925 Thanks for the extensive testing. I usually perform my performance test on Intel hardware. I don't realize that Zen2 and arm64 perform better with irq on/off. > > For micro benchmarking I did 1.000.000.000 iterations of > preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore() > [IRQ]. > On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ > loop took 7,494,119,638ns. The absolute numbers are not important, it is > worth noting that preemption off/on is less expensive than IRQ off/on. > Except for AMD and ARM64 where IRQ off/on was less expensive. The whole > loop was performed with disabled interrupts so I don't expect much > interference - but then I don't know much the µArch optimized away on > local_irq_restore() given that the interrupts were already disabled. > I don't have any recent Intel HW (I think) so I don't know if this is an > Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess > that the recent uArch would behave similar to AMD. > > Moving on: For the test I run 100,000,000 iterations of > kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT)); > > The BH suffix means that in_task() reported false during the allocation, > otherwise it reported true. > SD is the standard deviation. > SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT. > OPT means the optimisation (in_task() + task_obj) is active, NO-OPT > means no optimisation (irq_obj is always used). > The numbers are the time in ns needed for 100,000,000 iteration (alloc + > free). I run 5 tests and used the median value here. If the standard > deviation exceeded 10^9 then I repeated the test. The values remained in > the same range since usually only one value was off and the other > remained in the same range. > > Sandy Bridge > SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT > ALLOC/FREE 8,519,295,176 9,051,200,652 10,627,431,395 11,198,189,843 > SD 5,309,768 29,253,976 129,102,317 40,681,909 > ALLOC/FREE BH 9,996,704,330 8,927,026,031 11,680,149,900 11,139,356,465 > SD 38,237,534 72,913,120 23,626,932 116,413,331 My own testing when tracking the number of times in_task() is true or false indicated most of the kmalloc() call is done by tasks. Only a few percents of the time is in_task() false. That is the reason why I optimize the case that in_task() is true. > > The optimisation is visible in the SERVER-OPT case (~1.5s difference in > the runtime (or 14.7ns per iteration)). There is hardly any difference > between BH and !BH in the SERVER-NO-OPT case. > For the SERVER case, the optimisation improves ~0.5s in runtime for the > !BH case. > For the PREEMPT case it also looks like ~0.5s improvement in the BH case > while in the BH case it looks the other way around. > > DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT > ALLOC/FREE 11,069,180,584 10,773,407,543 10,963,581,285 10,826,207,969 > SD 23,195,912 112,763,104 13,145,589 33,543,625 > ALLOC/FREE BH 11,443,342,069 10,720,094,700 11,064,914,727 10,955,883,521 > SD 81,150,074 171,299,554 58,603,778 84,131,143 > > DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is > default. I don't see any difference vs CONFIG_PREEMPT except the > default preemption state (so I didn't test that). The preemption counter > is always forced-in so preempt_enable()/disable() is not optimized away. > SRV is the default value (PREEMPT_NONE) and FULL is the overriden > (preempt=full) state. > > Based on that, I don't see any added value by the optimisation once > PREEMPT_DYNAMIC is enabled. The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data points, I am not going to object to this patch then. I will try to look further into why this is the case when I have time. Cheers, Longman
On 2022-01-05 22:28:10 [-0500], Waiman Long wrote: > Thanks for the extensive testing. I usually perform my performance test on > Intel hardware. I don't realize that Zen2 and arm64 perform better with irq > on/off. Maybe you have access to more recent µArch from Intel which could behave different. > My own testing when tracking the number of times in_task() is true or false > indicated most of the kmalloc() call is done by tasks. Only a few percents > of the time is in_task() false. That is the reason why I optimize the case > that in_task() is true. Right. This relies on the fact that changing preemption is cheaper which is not always true. The ultimate benefit is of course when the preemption changes can be removed/ optimized away. > > Based on that, I don't see any added value by the optimisation once > > PREEMPT_DYNAMIC is enabled. > > The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data points, > I am not going to object to this patch then. I will try to look further into > why this is the case when I have time. Okay, thank you. In the SERVER case we keep the preemption counter so this has obviously an impact. I am a little surprised that the DYN-FULL and DYN-NONE differ a little since the code runs with disabled interrupts. But then this might be the extra jump to preempt_schedule() which is patched-out in the SERVER case. > Cheers, > Longman Sebastian
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1e76f26be2c15..92180f1aa9edc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2126,7 +2126,7 @@ struct memcg_stock_pcp { local_lock_t stock_lock; struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION /* Protects only task_obj */ local_lock_t task_obj_lock; struct obj_stock task_obj; @@ -2139,7 +2139,7 @@ struct memcg_stock_pcp { }; static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { .stock_lock = INIT_LOCAL_LOCK(stock_lock), -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock), #endif }; @@ -2228,7 +2228,7 @@ static void drain_local_stock(struct work_struct *dummy) * drain_stock races is that we always operate on local CPU stock * here with IRQ disabled */ -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION local_lock(&memcg_stock.task_obj_lock); old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL); local_unlock(&memcg_stock.task_obj_lock); @@ -2837,7 +2837,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags, { struct memcg_stock_pcp *stock; -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION if (likely(in_task())) { *pflags = 0UL; *stock_pcp = NULL; @@ -2855,7 +2855,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags, static inline void put_obj_stock(unsigned long flags, struct memcg_stock_pcp *stock_pcp) { -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION if (likely(!stock_pcp)) local_unlock(&memcg_stock.task_obj_lock); else @@ -3267,7 +3267,7 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION if (in_task() && stock->task_obj.cached_objcg) { memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
Based on my understanding the optimisation with task_obj for in_task() mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable() is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel instead to only PREEMPT_RT. With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be configured but these kernels always have preempt_disable()/enable() present so it probably makes no sense here for the optimisation. Restrict the optimisation to !CONFIG_PREEMPTION kernels. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- mm/memcontrol.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)