diff mbox series

[RFC,3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.

Message ID 20211222114111.2206248-4-bigeasy@linutronix.de (mailing list archive)
State New
Headers show
Series mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand

Commit Message

Sebastian Andrzej Siewior Dec. 22, 2021, 11:41 a.m. UTC
Based on my understanding the optimisation with task_obj for in_task()
mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
instead to only PREEMPT_RT.
With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
configured but these kernels always have preempt_disable()/enable()
present so it probably makes no sense here for the optimisation.

Restrict the optimisation to !CONFIG_PREEMPTION kernels.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Waiman Long Dec. 23, 2021, 9:48 p.m. UTC | #1
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> Based on my understanding the optimisation with task_obj for in_task()
> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> instead to only PREEMPT_RT.
> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> configured but these kernels always have preempt_disable()/enable()
> present so it probably makes no sense here for the optimisation.
>
> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My 
understanding is that some distros are going to use PREEMPT_DYNAMIC, but 
default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to 
disable the optimization based on PREEMPTION alone.

Regards,
Longman
Sebastian Andrzej Siewior Jan. 3, 2022, 2:44 p.m. UTC | #2
On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > Based on my understanding the optimisation with task_obj for in_task()
> > mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> > is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> > instead to only PREEMPT_RT.
> > With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> > configured but these kernels always have preempt_disable()/enable()
> > present so it probably makes no sense here for the optimisation.
> > 
> > Restrict the optimisation to !CONFIG_PREEMPTION kernels.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
> disable the optimization based on PREEMPTION alone.

So there is a benefit to this even if preempt_disable() is not optimized
away? My understanding was that this depends on preempt_disable() being
optimized away.
Is there something you recommend as a benchmark where I could get some
numbers?

> Regards,
> Longman

Sebastian
Waiman Long Jan. 3, 2022, 3:04 p.m. UTC | #3
On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> Based on my understanding the optimisation with task_obj for in_task()
>>> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
>>> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
>>> instead to only PREEMPT_RT.
>>> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
>>> configured but these kernels always have preempt_disable()/enable()
>>> present so it probably makes no sense here for the optimisation.
>>>
>>> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>>>
>>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
>> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
>> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
>> disable the optimization based on PREEMPTION alone.
> So there is a benefit to this even if preempt_disable() is not optimized
> away? My understanding was that this depends on preempt_disable() being
> optimized away.
> Is there something you recommend as a benchmark where I could get some
> numbers?

In the case of PREEMPT_DYNAMIC, it depends on the default setting which 
is used by most users. I will support disabling the optimization if 
defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by 
CONFIG_)PREEMPTION alone.

As for microbenchmark, something that makes a lot of calls to malloc() 
or related allocations can be used.

Cheers,
Longman
Sebastian Andrzej Siewior Jan. 5, 2022, 8:22 p.m. UTC | #4
On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> > Is there something you recommend as a benchmark where I could get some
> > numbers?
> 
> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
> used by most users. I will support disabling the optimization if
> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
> CONFIG_)PREEMPTION alone.
> 
> As for microbenchmark, something that makes a lot of calls to malloc() or
> related allocations can be used.

Numbers I made:

        Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

For micro benchmarking I did 1.000.000.000 iterations of
preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
[IRQ].
On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
loop took 7,494,119,638ns. The absolute numbers are not important, it is
worth noting that preemption off/on is less expensive than IRQ off/on.
Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
loop was performed with disabled interrupts so I don't expect much
interference - but then I don't know much the µArch optimized away on
local_irq_restore() given that the interrupts were already disabled.
I don't have any recent Intel HW (I think) so I don't know if this is an
Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
that the recent uArch would behave similar to AMD.

Moving on: For the test I run 100,000,000 iterations of
     kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));

The BH suffix means that in_task() reported false during the allocation,
otherwise it reported true.
SD is the standard deviation.
SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
means no optimisation (irq_obj is always used).
The numbers are the time in ns needed for 100,000,000 iteration (alloc +
free). I run 5 tests and used the median value here. If the standard
deviation exceeded 10^9 then I repeated the test. The values remained in
the same range since usually only one value was off and the other
remained in the same range.

Sandy Bridge
                 SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
SD                5,309,768      29,253,976       129,102,317      40,681,909
ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
SD               38,237,534      72,913,120        23,626,932     116,413,331

The optimisation is visible in the SERVER-OPT case (~1.5s difference in
the runtime (or 14.7ns per iteration)). There is hardly any difference
between BH and !BH in the SERVER-NO-OPT case.
For the SERVER case, the optimisation improves ~0.5s in runtime for the
!BH case.
For the PREEMPT case it also looks like ~0.5s improvement in the BH case
while in the BH case it looks the other way around.

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
SD                 23,195,912     112,763,104      13,145,589        33,543,625
ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
SD                 81,150,074     171,299,554      58,603,778        84,131,143

DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
default.  I don't see any difference vs CONFIG_PREEMPT except the
default preemption state (so I didn't test that). The preemption counter
is always forced-in so preempt_enable()/disable() is not optimized away.
SRV is the default value (PREEMPT_NONE) and FULL is the overriden
(preempt=full) state.

Based on that, I don't see any added value by the optimisation once
PREEMPT_DYNAMIC is enabled.

----
Zen2:
                 SERVER OPT   SERVER NO-OPT   PREEMPT OPT      PREEMPT NO-OPT
ALLOC/FREE    8,126,735,313   8,751,307,383    9,822,927,142   10,045,105,425
SD              100,806,471      87,234,047       55,170,179       25,832,386
ALLOC/FREE BH 9,197,455,885   8,394,337,053   10,671,227,095    9,904,954,934
SD              155,223,919      57,800,997       47,529,496      105,260,566

On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it
looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I
re-run it twice and nothing significant changed… However the difference
on PREEMPT for the !BH case is not as significant as on Sandy Bridge
(~200ms here vs ~500ms there).

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT  DYN-FULL-NO-OPT
ALLOC/FREE      9,680,498,929  10,180,973,847   9,644,453,405  10,224,416,854
SD                 73,944,156      61,850,527      13,277,203     107,145,212
ALLOC/FREE BH  10,680,074,634   9,956,695,323  10,704,442,515   9,942,155,910
SD                 75,535,172      34,524,493      54,625,678      87,163,920

For double testing and checking, the full git tree is available at [0]
and the script to parse the results is at [1].

[0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg
[1] https://breakpoint.cc/parse-memcg.py

> Cheers,
> Longman

Sebastian
Waiman Long Jan. 6, 2022, 3:28 a.m. UTC | #5
On 1/5/22 15:22, Sebastian Andrzej Siewior wrote:
> On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
>> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
>>> Is there something you recommend as a benchmark where I could get some
>>> numbers?
>> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
>> used by most users. I will support disabling the optimization if
>> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
>> CONFIG_)PREEMPTION alone.
>>
>> As for microbenchmark, something that makes a lot of calls to malloc() or
>> related allocations can be used.
> Numbers I made:
>
>          Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
> PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
> IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

Thanks for the extensive testing. I usually perform my performance test 
on Intel hardware. I don't realize that Zen2 and arm64 perform better 
with irq on/off.


>
> For micro benchmarking I did 1.000.000.000 iterations of
> preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
> [IRQ].
> On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
> loop took 7,494,119,638ns. The absolute numbers are not important, it is
> worth noting that preemption off/on is less expensive than IRQ off/on.
> Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
> loop was performed with disabled interrupts so I don't expect much
> interference - but then I don't know much the µArch optimized away on
> local_irq_restore() given that the interrupts were already disabled.
> I don't have any recent Intel HW (I think) so I don't know if this is an
> Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
> that the recent uArch would behave similar to AMD.
>
> Moving on: For the test I run 100,000,000 iterations of
>       kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));
>
> The BH suffix means that in_task() reported false during the allocation,
> otherwise it reported true.
> SD is the standard deviation.
> SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
> OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
> means no optimisation (irq_obj is always used).
> The numbers are the time in ns needed for 100,000,000 iteration (alloc +
> free). I run 5 tests and used the median value here. If the standard
> deviation exceeded 10^9 then I repeated the test. The values remained in
> the same range since usually only one value was off and the other
> remained in the same range.
>
> Sandy Bridge
>                   SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
> ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
> SD                5,309,768      29,253,976       129,102,317      40,681,909
> ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
> SD               38,237,534      72,913,120        23,626,932     116,413,331

My own testing when tracking the number of times in_task() is true or 
false indicated most of the kmalloc() call is done by tasks. Only a few 
percents of the time is in_task() false. That is the reason why I 
optimize the case that in_task() is true.


>
> The optimisation is visible in the SERVER-OPT case (~1.5s difference in
> the runtime (or 14.7ns per iteration)). There is hardly any difference
> between BH and !BH in the SERVER-NO-OPT case.
> For the SERVER case, the optimisation improves ~0.5s in runtime for the
> !BH case.
> For the PREEMPT case it also looks like ~0.5s improvement in the BH case
> while in the BH case it looks the other way around.
>
>                   DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
> ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
> SD                 23,195,912     112,763,104      13,145,589        33,543,625
> ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
> SD                 81,150,074     171,299,554      58,603,778        84,131,143
>
> DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
> default.  I don't see any difference vs CONFIG_PREEMPT except the
> default preemption state (so I didn't test that). The preemption counter
> is always forced-in so preempt_enable()/disable() is not optimized away.
> SRV is the default value (PREEMPT_NONE) and FULL is the overriden
> (preempt=full) state.
>
> Based on that, I don't see any added value by the optimisation once
> PREEMPT_DYNAMIC is enabled.

The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data 
points, I am not going to object to this patch then. I will try to look 
further into why this is the case when I have time.

Cheers,
Longman
Sebastian Andrzej Siewior Jan. 13, 2022, 3:26 p.m. UTC | #6
On 2022-01-05 22:28:10 [-0500], Waiman Long wrote:
> Thanks for the extensive testing. I usually perform my performance test on
> Intel hardware. I don't realize that Zen2 and arm64 perform better with irq
> on/off.

Maybe you have access to more recent µArch from Intel which could behave
different.

> My own testing when tracking the number of times in_task() is true or false
> indicated most of the kmalloc() call is done by tasks. Only a few percents
> of the time is in_task() false. That is the reason why I optimize the case
> that in_task() is true.

Right. This relies on the fact that changing preemption is cheaper which
is not always true. The ultimate benefit is of course when the
preemption changes can be removed/ optimized away.

> > Based on that, I don't see any added value by the optimisation once
> > PREEMPT_DYNAMIC is enabled.
> 
> The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data points,
> I am not going to object to this patch then. I will try to look further into
> why this is the case when I have time.

Okay, thank you.
In the SERVER case we keep the preemption counter so this has obviously
an impact. I am a little surprised that the DYN-FULL and DYN-NONE
differ a little since the code runs with disabled interrupts. But then
this might be the extra jump to preempt_schedule() which is patched-out
in the SERVER case.

> Cheers,
> Longman

Sebastian
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1e76f26be2c15..92180f1aa9edc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2126,7 +2126,7 @@  struct memcg_stock_pcp {
 	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	/* Protects only task_obj */
 	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
@@ -2139,7 +2139,7 @@  struct memcg_stock_pcp {
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
 	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
 #endif
 };
@@ -2228,7 +2228,7 @@  static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	local_lock(&memcg_stock.task_obj_lock);
 	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
 	local_unlock(&memcg_stock.task_obj_lock);
@@ -2837,7 +2837,7 @@  static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
 {
 	struct memcg_stock_pcp *stock;
 
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (likely(in_task())) {
 		*pflags = 0UL;
 		*stock_pcp = NULL;
@@ -2855,7 +2855,7 @@  static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
 static inline void put_obj_stock(unsigned long flags,
 				 struct memcg_stock_pcp *stock_pcp)
 {
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (likely(!stock_pcp))
 		local_unlock(&memcg_stock.task_obj_lock);
 	else
@@ -3267,7 +3267,7 @@  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))