Message ID | 20210609113903.1421-1-vbabka@suse.cz (mailing list archive) |
---|---|
Headers | show |
Series | SLUB: reduce irq disabled scope and make it RT compatible | expand |
On Wed, Jun 09, 2021 at 01:38:29PM +0200, Vlastimil Babka wrote: > Changes since RFC v1 [1]: > * Addressed feedback from Christoph and Mel, added their acks. > * Finished RT conversion, including adopting 2 patches from the RT tree. > * The optional local_lock conversion has to sacrifice lockless fathpaths on RT > * Added some more cleanup patches to the front. > > This series was initially inspired by Mel's pcplist local_lock rewrite, and > also by interest to better understand SLUB's locking and the new locking > primitives and their RT variants and implications. It should make SLUB more > preemption-friendly and fully RT compatible, hopefully without noticeable > regressions on !RT kernels, as the fast paths are not affected there. > > Series is based on 5.13-rc5 and also available as a git branch: > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1 > This failed to boot for me inside KVM. [ 0.273576] SRBDS: Unknown: Dependent on hypervisor status [ 0.273576] MDS: Mitigation: Clear CPU buffers [ 0.273576] Freeing SMP alternatives memory: 36K [ 0.273576] bad: scheduling from the idle thread! [ 0.273576] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc5-unspecified+ #1 [ 0.273576] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 0.273576] Call Trace: [ 0.273576] dump_stack+0x76/0x94 [ 0.273576] dequeue_task_idle+0x28/0x40 [ 0.273576] __do_set_cpus_allowed+0xe0/0x1a0 [ 0.273576] __schedule+0x7f7/0x8f0 [ 0.273576] __cond_resched+0x22/0x40 [ 0.273576] alloc_vmap_area+0x72/0x8b0 [ 0.273576] ? kmem_cache_alloc_node_trace+0x189/0x300 [ 0.273576] ? __get_vm_area_node+0x76/0x150 [ 0.273576] __get_vm_area_node+0xab/0x150 [ 0.273576] __vmalloc_node_range+0x6d/0x2c0 [ 0.273576] ? kernel_clone+0x9b/0x3e0 [ 0.273576] ? kmem_cache_alloc_node+0x18d/0x2f0 [ 0.273576] ? copy_process+0x218/0x1c40 [ 0.273576] copy_process+0x2c1/0x1c40 [ 0.273576] ? kernel_clone+0x9b/0x3e0 [ 0.273576] ? enqueue_task_fair+0xa1/0x590 [ 0.273576] kernel_clone+0x9b/0x3e0 [ 0.273576] ? acpi_hw_register_read+0x146/0x166 [ 0.273576] kernel_thread+0x55/0x70 [ 0.273576] ? kthread_is_per_cpu+0x30/0x30 [ 0.273576] rest_init+0x75/0xc0 [ 0.273576] start_kernel+0x7fb/0x822 [ 0.273576] secondary_startup_64_no_verify+0xc2/0xcb [ 0.273576] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 0.273576] #PF: supervisor instruction fetch in kernel mode [ 0.273576] #PF: error_code(0x0010) - not-present page [ 0.273576] PGD 0 P4D 0 [ 0.273576] Oops: 0010 [#1] SMP PTI [ 0.273576] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc5-unspecified+ #1 [ 0.273576] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 0.273576] RIP: 0010:0x0 [ 0.273576] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 0.273576] RSP: 0000:ffffffffbb803bb0 EFLAGS: 00010002 [ 0.273576] RAX: 0000000000000000 RBX: ffff9fa64ea2ecc0 RCX: ffffffffb9e00101 [ 0.273576] RDX: 000000000000000a RSI: ffffffffbb81a940 RDI: ffff9fa64ea2ecc0 [ 0.273576] RBP: ffffffffbb803c08 R08: 0000000000000000 R09: ffffffffbbaa699c [ 0.273576] R10: 0000000000000000 R11: ffffffffbb803888 R12: ffffffffbb81a940 [ 0.273576] R13: ffff9f9a80245f40 R14: ffffffffbb1f0060 R15: 0000000000003fff [ 0.273576] FS: 0000000000000000(0000) GS:ffff9fa64ea00000(0000) knlGS:0000000000000000 [ 0.273576] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.273576] CR2: ffffffffffffffd6 CR3: 0000000a60e10001 CR4: 0000000000170ef0 [ 0.273576] Call Trace: [ 0.273576] __schedule+0x7f7/0x8f0 [ 0.273576] __cond_resched+0x22/0x40 [ 0.273576] alloc_vmap_area+0x72/0x8b0 [ 0.273576] ? kmem_cache_alloc_node_trace+0x189/0x300 [ 0.273576] ? __get_vm_area_node+0x76/0x150 [ 0.273576] __get_vm_area_node+0xab/0x150 [ 0.273576] __vmalloc_node_range+0x6d/0x2c0 [ 0.273576] ? kernel_clone+0x9b/0x3e0 [ 0.273576] ? kmem_cache_alloc_node+0x18d/0x2f0 [ 0.273576] ? copy_process+0x218/0x1c40 [ 0.273576] copy_process+0x2c1/0x1c40 [ 0.273576] ? kernel_clone+0x9b/0x3e0 [ 0.273576] ? enqueue_task_fair+0xa1/0x590 [ 0.273576] kernel_clone+0x9b/0x3e0 [ 0.273576] ? acpi_hw_register_read+0x146/0x166 [ 0.273576] kernel_thread+0x55/0x70 [ 0.273576] ? kthread_is_per_cpu+0x30/0x30 [ 0.273576] rest_init+0x75/0xc0 [ 0.273576] start_kernel+0x7fb/0x822 [ 0.273576] secondary_startup_64_no_verify+0xc2/0xcb [ 0.273576] Modules linked in: [ 0.273576] CR2: 0000000000000000 [ 0.273576] ---[ end trace 7199d6fbb50b4cf7 ]--- [ 0.273576] RIP: 0010:0x0 [ 0.273576] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 0.273576] RSP: 0000:ffffffffbb803bb0 EFLAGS: 00010002 [ 0.273576] RAX: 0000000000000000 RBX: ffff9fa64ea2ecc0 RCX: ffffffffb9e00101 [ 0.273576] RDX: 000000000000000a RSI: ffffffffbb81a940 RDI: ffff9fa64ea2ecc0 [ 0.273576] RBP: ffffffffbb803c08 R08: 0000000000000000 R09: ffffffffbbaa699c [ 0.273576] R10: 0000000000000000 R11: ffffffffbb803888 R12: ffffffffbb81a940 [ 0.273576] R13: ffff9f9a80245f40 R14: ffffffffbb1f0060 R15: 0000000000003fff [ 0.273576] FS: 0000000000000000(0000) GS:ffff9fa64ea00000(0000) knlGS:0000000000000000 [ 0.273576] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.273576] CR2: ffffffffffffffd6 CR3: 0000000a60e10001 CR4: 0000000000170ef0 [ 0.273576] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.273576] Rebooting in 90 seconds..
On 6/9/21 1:38 PM, Vlastimil Babka wrote: > Changes since RFC v1 [1]: > * Addressed feedback from Christoph and Mel, added their acks. > * Finished RT conversion, including adopting 2 patches from the RT tree. > * The optional local_lock conversion has to sacrifice lockless fathpaths on RT > * Added some more cleanup patches to the front. > > This series was initially inspired by Mel's pcplist local_lock rewrite, and > also by interest to better understand SLUB's locking and the new locking > primitives and their RT variants and implications. It should make SLUB more > preemption-friendly and fully RT compatible, hopefully without noticeable > regressions on !RT kernels, as the fast paths are not affected there. > > Series is based on 5.13-rc5 and also available as a git branch: > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1 Pushed a new revision branch: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r2 which fixes a bug in [RFC v2 33/34] mm, slub: use migrate_disable() on PREEMPT_RT that was reported by Mel
On Mon, Jun 14, 2021 at 10:49:02AM +0100, Mel Gorman wrote: > On Wed, Jun 09, 2021 at 01:38:29PM +0200, Vlastimil Babka wrote: > > Changes since RFC v1 [1]: > > * Addressed feedback from Christoph and Mel, added their acks. > > * Finished RT conversion, including adopting 2 patches from the RT tree. > > * The optional local_lock conversion has to sacrifice lockless fathpaths on RT > > * Added some more cleanup patches to the front. > > > > This series was initially inspired by Mel's pcplist local_lock rewrite, and > > also by interest to better understand SLUB's locking and the new locking > > primitives and their RT variants and implications. It should make SLUB more > > preemption-friendly and fully RT compatible, hopefully without noticeable > > regressions on !RT kernels, as the fast paths are not affected there. > > > > Series is based on 5.13-rc5 and also available as a git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1 > > > > This failed to boot for me inside KVM. > Bisection pointed to "mm, slub: use migrate_disable() on PREEMPT_RT". Off-list, Vlastimil noted that the #ifdef was wrong and the patch below booted. Wider battery of tests is queued but will take a few days to complete. ---8<--- mm, slub: use migrate_disable() on PREEMPT_RT -fix Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index caa206213e72..f0359b0c8154 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -119,7 +119,7 @@ * We could simply use migrate_disable()/enable() but as long as it's a * function call even on !PREEMPT_RT, use inline preempt_disable() there. */ -#ifdef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPT_RT #define slub_get_cpu_ptr(var) get_cpu_ptr(var) #define slub_put_cpu_ptr(var) put_cpu_ptr(var) #else
I replaced my slub changes with slub-local-lock-v2r3. I haven't seen any complains from lockdep or so which is good. Then I did this with RT enabled (and no debug): - A "time make -j32" run of allmodconfig on /dev/shm. Old: | real 20m6,217s | user 568m22,553s | sys 48m33,126s New: | real 20m9,049s | user 569m32,096s | sys 48m47,670s These 3 seconds here are probably in the noise range. - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500 Old: | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% ) | 7.683.944 context-switches # 0,017 M/sec ( +- 0,86% ) | 931.380 cpu-migrations # 0,002 M/sec ( +- 4,94% ) | 219.569 page-faults # 0,472 K/sec ( +- 0,39% ) | 1.104.727.599.918 cycles # 2,376 GHz ( +- 0,18% ) | 941.428.898.087 stalled-cycles-frontend # 85,22% frontend cycles idle ( +- 0,24% ) | 729.016.546.572 stalled-cycles-backend # 65,99% backend cycles idle ( +- 0,32% ) | 340.133.571.519 instructions # 0,31 insn per cycle | # 2,77 stalled cycles per insn ( +- 0,12% ) | 73.746.821.314 branches # 158,607 M/sec ( +- 0,13% ) | 377.838.006 branch-misses # 0,51% of all branches ( +- 1,01% ) | | 17,0820 +- 0,0202 seconds time elapsed ( +- 0,12% ) New: | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% ) | 14.594.238 context-switches # 0,035 M/sec ( +- 0,43% ) | 3.737.926 cpu-migrations # 0,009 M/sec ( +- 0,46% ) | 218.474 page-faults # 0,517 K/sec ( +- 0,74% ) | 940.715.812.020 cycles # 2,225 GHz ( +- 0,34% ) | 716.593.827.820 stalled-cycles-frontend # 76,18% frontend cycles idle ( +- 0,39% ) | 550.730.862.839 stalled-cycles-backend # 58,54% backend cycles idle ( +- 0,43% ) | 417.274.588.907 instructions # 0,44 insn per cycle | # 1,72 stalled cycles per insn ( +- 0,17% ) | 92.814.150.290 branches # 219,488 M/sec ( +- 0,17% ) | 822.102.170 branch-misses # 0,89% of all branches ( +- 0,41% ) | | 88,427 +- 0,618 seconds time elapsed ( +- 0,70% ) So this is outside of the noise range. I'm not sure where this is coming from. My guess would be higher lock contention within the memory allocator. > The remaining patches to upstream from the RT tree are small ones related to > KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes > sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could > perhaps be re-evaluated as the series also addresses some latency issues with > percpu partial slabs. With that series the PARTIAL slab can be indeed enabled. I have (had) a half done series where I had PARTIAL enabled and noticed a slight increase in latency so made it "default y on !RT". It wasn't dramatic but appeared to be outside of noise. Sebastian
On 7/2/21 8:29 PM, Sebastian Andrzej Siewior wrote: > I replaced my slub changes with slub-local-lock-v2r3. > I haven't seen any complains from lockdep or so which is good. Then I > did this with RT enabled (and no debug): Thanks for testing! > - A "time make -j32" run of allmodconfig on /dev/shm. > Old: > | real 20m6,217s > | user 568m22,553s > | sys 48m33,126s > > New: > | real 20m9,049s > | user 569m32,096s > | sys 48m47,670s > > These 3 seconds here are probably in the noise range. > > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500 > Old: > | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% ) > | 7.683.944 context-switches # 0,017 M/sec ( +- 0,86% ) > | 931.380 cpu-migrations # 0,002 M/sec ( +- 4,94% ) > | 219.569 page-faults # 0,472 K/sec ( +- 0,39% ) > | 1.104.727.599.918 cycles # 2,376 GHz ( +- 0,18% ) > | 941.428.898.087 stalled-cycles-frontend # 85,22% frontend cycles idle ( +- 0,24% ) > | 729.016.546.572 stalled-cycles-backend # 65,99% backend cycles idle ( +- 0,32% ) > | 340.133.571.519 instructions # 0,31 insn per cycle > | # 2,77 stalled cycles per insn ( +- 0,12% ) > | 73.746.821.314 branches # 158,607 M/sec ( +- 0,13% ) > | 377.838.006 branch-misses # 0,51% of all branches ( +- 1,01% ) > | > | 17,0820 +- 0,0202 seconds time elapsed ( +- 0,12% ) > > New: > | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% ) > | 14.594.238 context-switches # 0,035 M/sec ( +- 0,43% ) > | 3.737.926 cpu-migrations # 0,009 M/sec ( +- 0,46% ) > | 218.474 page-faults # 0,517 K/sec ( +- 0,74% ) > | 940.715.812.020 cycles # 2,225 GHz ( +- 0,34% ) > | 716.593.827.820 stalled-cycles-frontend # 76,18% frontend cycles idle ( +- 0,39% ) > | 550.730.862.839 stalled-cycles-backend # 58,54% backend cycles idle ( +- 0,43% ) > | 417.274.588.907 instructions # 0,44 insn per cycle > | # 1,72 stalled cycles per insn ( +- 0,17% ) > | 92.814.150.290 branches # 219,488 M/sec ( +- 0,17% ) > | 822.102.170 branch-misses # 0,89% of all branches ( +- 0,41% ) > | > | 88,427 +- 0,618 seconds time elapsed ( +- 0,70% ) > > So this is outside of the noise range. > I'm not sure where this is coming from. My guess would be higher lock > contention within the memory allocator. The series shouldn't significantly change the memory allocator interaction, though. Seems there's less cycles, but more time elapsed, thus more sleeping - is it locks becoming mutexes on RT? My first guess - the last, local_lock patch. What would happen if you take that one out? Should be still RT-compatible. If it improves a lot, maybe that conversion to local_lock is not worth it then. My second guess - list_lock remains spinlock with my series, thus RT mutex, but the current RT tree converts it to raw_spinlock. I'd hope leaving that one as non-raw spinlock would still be much better for RT goals, even if hackbench (which is AFAIK very slab intensive) throughput regresses - hopefully not that much. >> The remaining patches to upstream from the RT tree are small ones related to >> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes >> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could >> perhaps be re-evaluated as the series also addresses some latency issues with >> percpu partial slabs. > > With that series the PARTIAL slab can be indeed enabled. I have (had) a > half done series where I had PARTIAL enabled and noticed a slight > increase in latency so made it "default y on !RT". It wasn't dramatic > but appeared to be outside of noise. > > Sebastian >
On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote: > I replaced my slub changes with slub-local-lock-v2r3. > I haven't seen any complains from lockdep or so which is good. Then I > did this with RT enabled (and no debug): Below is some raw hackbench data from my little i4790 desktop box. It says we'll definitely still want list_lock to be raw. It also appears to be saying that there's something RT specific to stare at in addition to the list_lock business, but add a pinch of salt to that due to the config of the virgin(ish) tip tree being much lighter than the enterprise(ish) config of the tip-rt tree. perf stat -r10 hackbench -s4096 -l500 full warmup, record, repeat twice for elapsed 5.13.0.g60ab3ed-tip-rt 8,898.51 msec task-clock # 7.525 CPUs utilized ( +- 0.33% ) 368,922 context-switches # 0.041 M/sec ( +- 5.20% ) 42,281 cpu-migrations # 0.005 M/sec ( +- 5.28% ) 13,180 page-faults # 0.001 M/sec ( +- 0.70% ) 33,343,378,867 cycles # 3.747 GHz ( +- 0.30% ) 21,656,783,887 instructions # 0.65 insn per cycle ( +- 0.67% ) 4,408,569,663 branches # 495.428 M/sec ( +- 0.73% ) 12,040,125 branch-misses # 0.27% of all branches ( +- 2.93% ) 1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% ) 1.19018 +- 0.00441 seconds time elapsed ( +- 0.37% ) (repeat) 1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% ) (repeat) 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=raw_spinlock_t 9,642.00 msec task-clock # 7.521 CPUs utilized ( +- 0.46% ) 462,091 context-switches # 0.048 M/sec ( +- 4.79% ) 44,411 cpu-migrations # 0.005 M/sec ( +- 4.34% ) 12,980 page-faults # 0.001 M/sec ( +- 0.43% ) 36,098,859,429 cycles # 3.744 GHz ( +- 0.44% ) 25,462,853,462 instructions # 0.71 insn per cycle ( +- 0.50% ) 5,260,898,360 branches # 545.623 M/sec ( +- 0.52% ) 16,088,686 branch-misses # 0.31% of all branches ( +- 2.02% ) 1.28207 +- 0.00568 seconds time elapsed ( +- 0.44% ) 1.28744 +- 0.00713 seconds time elapsed ( +- 0.55% ) (repeat) 1.28085 +- 0.00850 seconds time elapsed ( +- 0.66% ) (repeat) 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=spinlock_t 10,004.89 msec task-clock # 6.029 CPUs utilized ( +- 1.37% ) 654,311 context-switches # 0.065 M/sec ( +- 5.16% ) 211,070 cpu-migrations # 0.021 M/sec ( +- 1.38% ) 13,262 page-faults # 0.001 M/sec ( +- 0.79% ) 36,585,914,931 cycles # 3.657 GHz ( +- 1.35% ) 27,682,240,511 instructions # 0.76 insn per cycle ( +- 1.06% ) 5,766,064,432 branches # 576.325 M/sec ( +- 1.11% ) 24,269,069 branch-misses # 0.42% of all branches ( +- 2.03% ) 1.6595 +- 0.0116 seconds time elapsed ( +- 0.70% ) 1.6270 +- 0.0180 seconds time elapsed ( +- 1.11% ) (repeat) 1.6213 +- 0.0150 seconds time elapsed ( +- 0.93% ) (repeat) virgin(ish) tip 5.13.0.g60ab3ed-tip 7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% ) 221,215 context-switches # 0.030 M/sec ( +- 3.97% ) 16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% ) 13,233 page-faults # 0.002 M/sec ( +- 0.91% ) 27,592,205,252 cycles # 3.769 GHz ( +- 0.32% ) 8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% ) 1,555,210,607 branches # 212.441 M/sec ( +- 0.42% ) 5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% ) 0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% ) 0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat) 0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% ) 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3 7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% ) 223,044 context-switches # 0.030 M/sec ( +- 3.02% ) 16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% ) 13,164 page-faults # 0.002 M/sec ( +- 0.97% ) 27,684,906,017 cycles # 3.770 GHz ( +- 0.45% ) 8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% ) 1,556,106,680 branches # 211.901 M/sec ( +- 0.31% ) 5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% ) 0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% ) 0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat) 0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)
On Sat, 2021-07-03 at 09:24 +0200, Mike Galbraith wrote: > > It also appears to be saying that there's something RT specific to > stare at in addition to the list_lock business. The what is ___slab_alloc() consuming 3.9% CPU in tip-rt-slub whereas it consumes < 1% in both tip-rt (sans slub patches) and tip-slub. The why remains to ponder. 5.13.0.g60ab3ed-tip-rt 5.13.0.g60ab3ed-tip-rt-slub 5.13.0.g60ab3ed-tip-slub 25.18% copy_user_enhanced_fast_string copy_user_enhanced_fast_string copy_user_enhanced_fast_string 5.08% unix_stream_read_generic unix_stream_read_generic unix_stream_read_generic 3.39% rt_spin_lock *** ___slab_alloc *** __skb_datagram_iter 2.80% __skb_datagram_iter rt_spin_lock _raw_spin_lock 2.11% get_page_from_freelist __skb_datagram_iter __alloc_skb 2.01% skb_release_data rt_spin_unlock skb_release_data 1.94% rt_spin_unlock get_page_from_freelist __alloc_pages 1.85% __alloc_skb migrate_enable unix_stream_sendmsg 1.68% __schedule skb_release_data _raw_spin_lock_irqsave 1.67% unix_stream_sendmsg __schedule free_pcppages_bulk 1.50% free_pcppages_bulk unix_stream_sendmsg __slab_free 1.38% migrate_enable free_pcppages_bulk __fget_light 1.24% __fget_light __alloc_pages vfs_write 1.16% __slab_free migrate_disable __schedule 1.14% __alloc_pages __fget_light get_page_from_freelist 1.10% fsnotify __slab_free new_sync_write 1.07% kfree fsnotify fsnotify 5.13.0.g60ab3ed-tip-rt-slub ___slab_alloc() consumes 3.90% 0.40 │ mov 0x28(%r13),%edx 0.42 │ add %r15,%rdx │ __swab(): │ #endif │ │ static __always_inline unsigned long __swab(const unsigned long y) │ { │ #if __BITS_PER_LONG == 64 │ return __swab64(y); 0.05 │ mov %rdx,%rax 1.14 │ bswap %rax │ freelist_ptr(): │ return (void *)((unsigned long)ptr ^ s->random ^ <== CONFIG_SLAB_FREELIST_HARDENED 0.72 │ xor 0xb0(%r13),%rax 65.41 │ xor (%rdx),%rax <== huh? miss = 65% of that 3.9% kernel util? │ next_tid(): │ return tid + TID_STEP; 0.09 │ addq $0x200,0x48(%r12) │ ___slab_alloc(): │ * freelist is pointing to the list of objects to be used. │ * page is pointing to the page from which the objects are obtained. │ * That page must be frozen for per cpu allocations to work. │ */ │ VM_BUG_ON(!c->page->frozen); │ c->freelist = get_freepointer(s, freelist); 0.05 │ mov %rax,0x40(%r12) │ c->tid = next_tid(c->tid); │ local_unlock_irqrestore(&s->cpu_slab->lock, flags); 5.13.0.g60ab3ed-tip-rt ___slab_alloc() consumes < 1% Percent│ } │ │ /* must check again c->freelist in case of cpu migration or IRQ */ │ freelist = c->freelist; 0.02 │ a1: mov (%r14),%r13 │ if (freelist) │ test %r13,%r13 0.02 │ ↓ je 460 │ get_freepointer(): │ return freelist_dereference(s, object + s->offset); 0.23 │ ad: mov 0x28(%r12),%edx 0.18 │ add %r13,%rdx │ __swab(): │ #endif │ │ static __always_inline unsigned long __swab(const unsigned long y) │ { │ #if __BITS_PER_LONG == 64 │ return __swab64(y); 0.06 │ mov %rdx,%rax 1.16 │ bswap %rax │ freelist_ptr(): │ return (void *)((unsigned long)ptr ^ s->random ^ 0.23 │ xor 0xb0(%r12),%rax 35.25 │ xor (%rdx),%rax <== 35% of < 1% kernel util │ next_tid(): │ return tid + TID_STEP; 0.28 │ addq $0x200,0x8(%r14) │ ___slab_alloc(): │ * freelist is pointing to the list of objects to be used. │ * page is pointing to the page from which the objects are obtained. │ * That page must be frozen for per cpu allocations to work. │ */ │ VM_BUG_ON(!c->page->frozen); │ c->freelist = get_freepointer(s, freelist); 5.13.0.g60ab3ed-tip-slub ___slab_alloc() also consumes < 1% Percent│ load_freelist: │ │ lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); 0.28 │ 84: add this_cpu_off,%rax │ get_freepointer(): │ return freelist_dereference(s, object + s->offset); 0.14 │ mov 0x28(%r14),%eax │ ___slab_alloc(): │ * freelist is pointing to the list of objects to be used. │ * page is pointing to the page from which the objects are obtained. │ * That page must be frozen for per cpu allocations to work. │ */ │ VM_BUG_ON(!c->page->frozen); │ c->freelist = get_freepointer(s, freelist); 34.36 │ mov 0x0(%r13,%rax,1),%rax │ next_tid(): │ return tid + TID_STEP; 0.10 │ addq $0x1,0x8(%r12) │ ___slab_alloc(): │ c->freelist = get_freepointer(s, freelist); 0.04 │ mov %rax,(%r12) │ c->tid = next_tid(c->tid); │ local_unlock_irqrestore(&s->cpu_slab->lock, flags); 0.12 │ mov (%r14),%rax 0.03 │ add this_cpu_off,%rax │ arch_local_irq_restore(): │ return arch_irqs_disabled_flags(flags); │ }
On Sat, 2021-07-03 at 17:47 +0200, Mike Galbraith wrote: > On Sat, 2021-07-03 at 09:24 +0200, Mike Galbraith wrote: > > It also appears to be saying that there's something RT specific to > > stare at in addition to the list_lock business. > > The what is ___slab_alloc() consuming 3.9% CPU in tip-rt-slub whereas > it consumes < 1% in both tip-rt (sans slub patches) and tip-slub. > > The why remains to ponder. Ignoring odd distribution in the profile (red herring whackable via prefetch games), and removing perf/debug options from the picture, the remaining RT specific cost is just that of forced slow path. For hackbench in my little box, it adds up to right at 3%. -Mike
On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote: > > > The remaining patches to upstream from the RT tree are small ones > > related to KConfig. The patch that restricts PREEMPT_RT to SLUB > > (not SLAB or SLOB) makes sense. The patch that disables > > CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re- > > evaluated as the series also addresses some latency issues with > > percpu partial slabs. > > With that series the PARTIAL slab can be indeed enabled. I have (had) > a half done series where I had PARTIAL enabled and noticed a slight > increase in latency so made it "default y on !RT". It wasn't dramatic > but appeared to be outside of noise. I'm seeing warnings/explosions while exercising box IFF PARTIAL slab thingy is enabled. I aborted -PARTIAL after a little over 4 hours, whereas the longest survival of 4 +PARTIAL runs was 50 minutes, so I'm fairly confident that PARTIAL really really is the trigger. My reproducer is a keep box busy loop of parallel kbuild along with alternating ltp::controllers (sans swap-from-hell bits), ltp::zram and hackbench. The warning may or may not trigger before box explodes. [ 224.232200] ------------[ cut here ]------------ [ 224.232202] WARNING: CPU: 3 PID: 9112 at mm/slub.c:2018 ___slab_alloc.constprop.100+0xb09/0x13a0 [ 224.232210] Modules linked in: sr_mod(E) cdrom(E) zram(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) xfs(E) libcrc32c(E) af_packet(E) ip6table_mangle(E) ip6table_raw(E) iptable_raw(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nfnetlink(E) ebtable_filter(E) rfkill(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) bpfilter(E) joydev(E) usblp(E) intel_rapl_msr(E) intel_rapl_common(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_hda_intel(E) aesni_intel(E) snd_intel_dspcfg(E) crypto_simd(E) iTCO_wdt(E) at24(E) intel_pmc_bxt(E) regmap_i2c(E) mei_hdcp(E) iTCO_vendor_support(E) cryptd(E) snd_hda_codec(E) snd_hwdep(E) pcspkr(E) i2c_i801(E) snd_hda_core(E) i2c_smbus(E) r8169(E) snd_pcm(E) realtek(E) snd_timer(E) [ 224.232253] mei_me(E) mdio_devres(E) lpc_ich(E) snd(E) libphy(E) mei(E) soundcore(E) mfd_core(E) fan(E) thermal(E) nfsd(E) intel_smartconnect(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq_codel(E) sunrpc(E) fuse(E) configfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) drm_ttm_helper(E) ttm(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) ahci(E) rc_core(E) xhci_pci(E) ehci_pci(E) libahci(E) xhci_hcd(E) ehci_hcd(E) drm(E) libata(E) usbcore(E) video(E) button(E) sd_mod(E) t10_pi(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) msr(E) autofs4(E) [ 224.232295] CPU: 3 PID: 9112 Comm: dd Kdump: loaded Tainted: G E 5.13.0.g60ab3ed-tip-rt-slub #25 [ 224.232297] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 224.232299] RIP: 0010:___slab_alloc.constprop.100+0xb09/0x13a0 [ 224.232302] Code: 48 8b 7d 88 4c 89 ee 8b 47 28 4c 01 f8 48 89 c2 48 0f ca 48 33 97 b8 00 00 00 48 33 10 e8 cf e5 ff ff e9 b5 00 00 00 4d 89 ce <0f> 0b e9 2d fa ff ff 8b 53 38 8b 05 8f 47 4a 01 48 8b 75 80 83 c2 [ 224.232303] RSP: 0018:ffff8882047e3ba0 EFLAGS: 00010046 [ 224.232305] RAX: ffff888100040f80 RBX: 0000000000000000 RCX: 0000000080190019 [ 224.232307] RDX: ffffea0005d59a08 RSI: 0000000000000000 RDI: ffffffff8177ab54 [ 224.232308] RBP: ffff8882047e3c70 R08: ffffffffffffffff R09: 0000000000000000 [ 224.232309] R10: ffff8882047e3c90 R11: 0000000000000000 R12: ffffea0004d60780 [ 224.232310] R13: 0000000000000019 R14: 0000000000000000 R15: 0000000080190019 [ 224.232311] FS: 00007f88ffa09580(0000) GS:ffff88840ecc0000(0000) knlGS:0000000000000000 [ 224.232312] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 224.232313] CR2: 00007f88ffa4a000 CR3: 000000015606a005 CR4: 00000000001706e0 [ 224.232315] Call Trace: [ 224.232318] ? __alloc_file+0x26/0xf0 [ 224.232322] ? prep_new_page+0xab/0x100 [ 224.232326] ? __alloc_file+0x26/0xf0 [ 224.232329] ? __slab_alloc.isra.87.constprop.99+0x2e/0x40 [ 224.232331] __slab_alloc.isra.87.constprop.99+0x2e/0x40 [ 224.232333] ? __alloc_file+0x26/0xf0 [ 224.232335] kmem_cache_alloc+0x4d/0x160 [ 224.232338] __alloc_file+0x26/0xf0 [ 224.232340] alloc_empty_file+0x43/0xe0 [ 224.232343] path_openat+0x35/0xe30 [ 224.232345] ? getname_flags+0x32/0x170 [ 224.232347] ? ___slab_alloc.constprop.100+0xbbb/0x13a0 [ 224.232349] ? filemap_map_pages+0xf0/0x3e0 [ 224.232352] do_filp_open+0xa2/0x100 [ 224.232355] ? __slab_alloc.isra.87.constprop.99+0x36/0x40 [ 224.232357] ? __slab_alloc.isra.87.constprop.99+0x36/0x40 [ 224.232359] ? getname_flags+0x32/0x170 [ 224.232361] ? alloc_fd+0xe2/0x1b0 [ 224.232363] ? do_sys_openat2+0x248/0x310 [ 224.232365] do_sys_openat2+0x248/0x310 [ 224.232368] do_sys_open+0x47/0x60 [ 224.232370] do_syscall_64+0x37/0x80 [ 224.232373] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 224.232376] RIP: 0033:0x7f88ff54ab41 [ 224.232378] Code: 41 83 e2 40 48 89 54 24 30 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00 [ 224.232379] RSP: 002b:00007ffc0916a720 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [ 224.232381] RAX: ffffffffffffffda RBX: 0000558d793efcf0 RCX: 00007f88ff54ab41 [ 224.232382] RDX: 0000000000080000 RSI: 0000558d793efcb0 RDI: 00000000ffffff9c [ 224.232383] RBP: 00007ffc0916a860 R08: 0000000000000000 R09: 00007ffc0916a873 [ 224.232384] R10: 0000000000000000 R11: 0000000000000287 R12: 0000558d793efa88 [ 224.232385] R13: 000000000000000b R14: 0000558d793efa60 R15: 000000000000000b [ 224.232387] ---[ end trace 0000000000000002 ]--- ... [ 2010.420941] general protection fault, maybe for address 0xffffea00054ca398: 0000 [#1] PREEMPT_RT SMP NOPTI [ 2010.420946] CPU: 4 PID: 17541 Comm: dd Kdump: loaded Tainted: G W E 5.13.0.g60ab3ed-tip-rt-slub #25 [ 2010.420948] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 2010.420949] RIP: 0010:___slab_alloc.constprop.100+0x52b/0x13a0 [ 2010.420955] Code: b7 44 24 2a 31 db 66 25 ff 7f 66 89 45 b8 48 8b 45 88 80 4d bb 80 48 8b 4d b8 f6 40 0b 40 0f 84 5b 02 00 00 48 89 f0 4c 89 fa <f0> 49 0f c7 4c 24 20 0f 84 8f 01 00 00 4d 89 ce f3 90 48 8b b5 30 [ 2010.420957] RSP: 0018:ffff888102cf7ba0 EFLAGS: 00010002 [ 2010.420959] RAX: ffff888100047e00 RBX: 0000000000000000 RCX: 0000000080000000 [ 2010.420961] RDX: 0000000000000000 RSI: ffff888100047e00 RDI: ffff888100040f80 [ 2010.420962] RBP: ffff888102cf7c70 R08: ffffffffffffffff R09: 0000000000000000 [ 2010.420963] R10: ffff888102cf7c90 R11: 0000000000000000 R12: ffffea00054ca378 [ 2010.420964] R13: 0000000000000000 R14: 80000000000101f8 R15: 0000000000000000 [ 2010.420966] FS: 00007f9e8edbb580(0000) GS:ffff88840ed00000(0000) knlGS:0000000000000000 [ 2010.420967] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2010.420968] CR2: 00007f9e8edfa000 CR3: 0000000102ede004 CR4: 00000000001706e0 [ 2010.420970] Call Trace: [ 2010.420973] ? __alloc_file+0x26/0xf0 [ 2010.420977] ? prep_new_page+0xab/0x100 [ 2010.420982] ? vm_area_alloc+0x1a/0x60 [ 2010.420984] ? __alloc_file+0x26/0xf0 [ 2010.420987] ? __slab_alloc.isra.87.constprop.99+0x2e/0x40 [ 2010.420989] __slab_alloc.isra.87.constprop.99+0x2e/0x40 [ 2010.420992] ? __alloc_file+0x26/0xf0 [ 2010.420994] kmem_cache_alloc+0x4d/0x160 [ 2010.420996] __alloc_file+0x26/0xf0 [ 2010.420999] alloc_empty_file+0x43/0xe0 [ 2010.421002] path_openat+0x35/0xe30 [ 2010.421004] ? getname_flags+0x32/0x170 [ 2010.421006] ? ___slab_alloc.constprop.100+0xbbb/0x13a0 [ 2010.421008] ? filemap_map_pages+0xf0/0x3e0 [ 2010.421012] do_filp_open+0xa2/0x100 [ 2010.421015] ? __handle_mm_fault+0x8b8/0xa40 [ 2010.421017] ? __slab_alloc.isra.87.constprop.99+0x36/0x40 [ 2010.421019] ? __slab_alloc.isra.87.constprop.99+0x36/0x40 [ 2010.421021] ? getname_flags+0x32/0x170 [ 2010.421023] ? alloc_fd+0xe2/0x1b0 [ 2010.421026] ? do_sys_openat2+0x248/0x310 [ 2010.421028] do_sys_openat2+0x248/0x310 [ 2010.421030] do_sys_open+0x47/0x60 [ 2010.421033] do_syscall_64+0x37/0x80 [ 2010.421035] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 2010.421039] RIP: 0033:0x7f9e8e8fcb41 [ 2010.421041] Code: 41 83 e2 40 48 89 54 24 30 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00 [ 2010.421043] RSP: 002b:00007ffe8964b7b0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101 [ 2010.421044] RAX: ffffffffffffffda RBX: 0000562e1e31b450 RCX: 00007f9e8e8fcb41 [ 2010.421045] RDX: 0000000000080000 RSI: 0000562e1e31b230 RDI: 00000000ffffff9c [ 2010.421047] RBP: 00007ffe8964b8f0 R08: 0000000000000000 R09: 00007ffe8964b903 [ 2010.421048] R10: 0000000000000000 R11: 0000000000000287 R12: 0000562e1e31b400 [ 2010.421049] R13: 0000000000000009 R14: 0000562e1e31b3e0 R15: 0000000000000009 [ 2010.421051] Modules linked in: zram(E) sr_mod(E) cdrom(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) xfs(E) libcrc32c(E) af_packet(E) ip6table_mangle(E) ip6table_raw(E) iptable_raw(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nfnetlink(E) ebtable_filter(E) rfkill(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) bpfilter(E) joydev(E) usblp(E) intel_rapl_msr(E) intel_rapl_common(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_hda_intel(E) aesni_intel(E) snd_intel_dspcfg(E) crypto_simd(E) iTCO_wdt(E) at24(E) intel_pmc_bxt(E) regmap_i2c(E) mei_hdcp(E) iTCO_vendor_support(E) cryptd(E) snd_hda_codec(E) snd_hwdep(E) pcspkr(E) i2c_i801(E) snd_hda_core(E) i2c_smbus(E) r8169(E) snd_pcm(E) realtek(E) snd_timer(E) [ 2010.421093] mei_me(E) mdio_devres(E) lpc_ich(E) snd(E) libphy(E) mei(E) soundcore(E) mfd_core(E) fan(E) thermal(E) nfsd(E) intel_smartconnect(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq_codel(E) sunrpc(E) fuse(E) configfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) drm_ttm_helper(E) ttm(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) ahci(E) rc_core(E) xhci_pci(E) ehci_pci(E) libahci(E) xhci_hcd(E) ehci_hcd(E) drm(E) libata(E) usbcore(E) video(E) button(E) sd_mod(E) t10_pi(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) msr(E) autofs4(E) [last unloaded: zram] [ 2010.421138] Dumping ftrace buffer: [ 2010.421141] (ftrace buffer empty)
On Mon, 2021-07-05 at 18:00 +0200, Mike Galbraith wrote: > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote: > > > > > The remaining patches to upstream from the RT tree are small ones > > > related to KConfig. The patch that restricts PREEMPT_RT to SLUB > > > (not SLAB or SLOB) makes sense. The patch that disables > > > CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re- > > > evaluated as the series also addresses some latency issues with > > > percpu partial slabs. > > > > With that series the PARTIAL slab can be indeed enabled. I have > > (had) > > a half done series where I had PARTIAL enabled and noticed a slight > > increase in latency so made it "default y on !RT". It wasn't > > dramatic > > but appeared to be outside of noise. > > I'm seeing warnings/explosions while exercising box IFF PARTIAL slab > thingy is enabled. I aborted -PARTIAL after a little over 4 hours, > whereas the longest survival of 4 +PARTIAL runs was 50 minutes, so > I'm fairly confident that PARTIAL really really is the trigger. Resurrecting local exclusion around unfreeze_partials() seems to have put an end to that. Guess I can chop these trees down now. --- mm/slub.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) --- a/mm/slub.c +++ b/mm/slub.c @@ -2497,7 +2497,9 @@ static void put_cpu_partial(struct kmem_ * partial array is full. Move the existing * set to the per node partial list. */ + local_lock(&s->cpu_slab->lock); unfreeze_partials(s); + local_unlock(&s->cpu_slab->lock); oldpage = NULL; pobjects = 0; pages = 0; @@ -2579,7 +2581,9 @@ static void flush_cpu_slab(struct work_s if (c->page) flush_slab(s, c, true); + local_lock(&s->cpu_slab->lock); unfreeze_partials(s); + local_unlock(&s->cpu_slab->lock); } static bool has_cpu_slab(int cpu, struct kmem_cache *s) @@ -3358,13 +3362,12 @@ static __always_inline void do_slab_free * we need to take the local_lock. We shouldn't simply defer to * __slab_free() as that wouldn't use the cpu freelist at all. */ - unsigned long flags; void **freelist; - local_lock_irqsave(&s->cpu_slab->lock, flags); + local_lock(&s->cpu_slab->lock); c = this_cpu_ptr(s->cpu_slab); if (unlikely(page != c->page)) { - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock(&s->cpu_slab->lock); goto redo; } tid = c->tid; @@ -3374,7 +3377,7 @@ static __always_inline void do_slab_free c->freelist = head; c->tid = next_tid(tid); - local_unlock_irqrestore(&s->cpu_slab->lock, flags); + local_unlock(&s->cpu_slab->lock); #endif stat(s, FREE_FASTPATH); } else @@ -3601,7 +3604,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca slab_want_init_on_alloc(flags, s)); return i; error: - local_unlock_irq(&s->cpu_slab->lock); + slub_put_cpu_ptr(s->cpu_slab); slab_post_alloc_hook(s, objcg, flags, i, p, false); __kmem_cache_free_bulk(s, i, p); return 0;
On 7/3/21 9:24 AM, Mike Galbraith wrote: > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote: >> I replaced my slub changes with slub-local-lock-v2r3. >> I haven't seen any complains from lockdep or so which is good. Then I >> did this with RT enabled (and no debug): > > Below is some raw hackbench data from my little i4790 desktop box. It > says we'll definitely still want list_lock to be raw. Hi Mike, thanks a lot for the testing, sorry for late reply. Did you try, instead of raw list_lock, not applying the last, local lock patch, as I suggested in reply to bigeasy? I think the impact at reducing the RT-specific overhead would be larger (than raw list_lock), the result should still be RT compatible, and it would also deal with the bugs you found there... (which I'll look into). Thanks, Vlastimil > It also appears to be saying that there's something RT specific to > stare at in addition to the list_lock business, but add a pinch of salt > to that due to the config of the virgin(ish) tip tree being much > lighter than the enterprise(ish) config of the tip-rt tree. > > perf stat -r10 hackbench -s4096 -l500 > full warmup, record, repeat twice for elapsed > > 5.13.0.g60ab3ed-tip-rt > 8,898.51 msec task-clock # 7.525 CPUs utilized ( +- 0.33% ) > 368,922 context-switches # 0.041 M/sec ( +- 5.20% ) > 42,281 cpu-migrations # 0.005 M/sec ( +- 5.28% ) > 13,180 page-faults # 0.001 M/sec ( +- 0.70% ) > 33,343,378,867 cycles # 3.747 GHz ( +- 0.30% ) > 21,656,783,887 instructions # 0.65 insn per cycle ( +- 0.67% ) > 4,408,569,663 branches # 495.428 M/sec ( +- 0.73% ) > 12,040,125 branch-misses # 0.27% of all branches ( +- 2.93% ) > > 1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% ) > 1.19018 +- 0.00441 seconds time elapsed ( +- 0.37% ) (repeat) > 1.18260 +- 0.00473 seconds time elapsed ( +- 0.40% ) (repeat) > > 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=raw_spinlock_t > 9,642.00 msec task-clock # 7.521 CPUs utilized ( +- 0.46% ) > 462,091 context-switches # 0.048 M/sec ( +- 4.79% ) > 44,411 cpu-migrations # 0.005 M/sec ( +- 4.34% ) > 12,980 page-faults # 0.001 M/sec ( +- 0.43% ) > 36,098,859,429 cycles # 3.744 GHz ( +- 0.44% ) > 25,462,853,462 instructions # 0.71 insn per cycle ( +- 0.50% ) > 5,260,898,360 branches # 545.623 M/sec ( +- 0.52% ) > 16,088,686 branch-misses # 0.31% of all branches ( +- 2.02% ) > > 1.28207 +- 0.00568 seconds time elapsed ( +- 0.44% ) > 1.28744 +- 0.00713 seconds time elapsed ( +- 0.55% ) (repeat) > 1.28085 +- 0.00850 seconds time elapsed ( +- 0.66% ) (repeat) > > 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=spinlock_t > 10,004.89 msec task-clock # 6.029 CPUs utilized ( +- 1.37% ) > 654,311 context-switches # 0.065 M/sec ( +- 5.16% ) > 211,070 cpu-migrations # 0.021 M/sec ( +- 1.38% ) > 13,262 page-faults # 0.001 M/sec ( +- 0.79% ) > 36,585,914,931 cycles # 3.657 GHz ( +- 1.35% ) > 27,682,240,511 instructions # 0.76 insn per cycle ( +- 1.06% ) > 5,766,064,432 branches # 576.325 M/sec ( +- 1.11% ) > 24,269,069 branch-misses # 0.42% of all branches ( +- 2.03% ) > > 1.6595 +- 0.0116 seconds time elapsed ( +- 0.70% ) > 1.6270 +- 0.0180 seconds time elapsed ( +- 1.11% ) (repeat) > 1.6213 +- 0.0150 seconds time elapsed ( +- 0.93% ) (repeat) > > virgin(ish) tip > 5.13.0.g60ab3ed-tip > 7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% ) > 221,215 context-switches # 0.030 M/sec ( +- 3.97% ) > 16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% ) > 13,233 page-faults # 0.002 M/sec ( +- 0.91% ) > 27,592,205,252 cycles # 3.769 GHz ( +- 0.32% ) > 8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% ) > 1,555,210,607 branches # 212.441 M/sec ( +- 0.42% ) > 5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% ) > > 0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% ) > 0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat) > 0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% ) > > 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3 > 7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% ) > 223,044 context-switches # 0.030 M/sec ( +- 3.02% ) > 16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% ) > 13,164 page-faults # 0.002 M/sec ( +- 0.97% ) > 27,684,906,017 cycles # 3.770 GHz ( +- 0.45% ) > 8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% ) > 1,556,106,680 branches # 211.901 M/sec ( +- 0.31% ) > 5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% ) > > 0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% ) > 0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat) > 0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat) >
On Sun, 2021-07-18 at 09:41 +0200, Vlastimil Babka wrote: > On 7/3/21 9:24 AM, Mike Galbraith wrote: > > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote: > > > I replaced my slub changes with slub-local-lock-v2r3. > > > I haven't seen any complains from lockdep or so which is good. > > > Then I > > > did this with RT enabled (and no debug): > > > > Below is some raw hackbench data from my little i4790 desktop > > box. It > > says we'll definitely still want list_lock to be raw. > > Hi Mike, thanks a lot for the testing, sorry for late reply. > > Did you try, instead of raw list_lock, not applying the last, local > lock patch, as I suggested in reply to bigeasy? No, but I suppose I can give that a go. -Mike
On Sun, 2021-07-18 at 10:29 +0200, Mike Galbraith wrote: > On Sun, 2021-07-18 at 09:41 +0200, Vlastimil Babka wrote: > > On 7/3/21 9:24 AM, Mike Galbraith wrote: > > > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior > > > wrote: > > > > I replaced my slub changes with slub-local-lock-v2r3. > > > > I haven't seen any complains from lockdep or so which is good. > > > > Then I > > > > did this with RT enabled (and no debug): > > > > > > Below is some raw hackbench data from my little i4790 desktop > > > box. It > > > says we'll definitely still want list_lock to be raw. > > > > Hi Mike, thanks a lot for the testing, sorry for late reply. > > > > Did you try, instead of raw list_lock, not applying the last, local > > lock patch, as I suggested in reply to bigeasy? > > No, but I suppose I can give that a go. The kernel has of course moved forward, so measure 'em again, but starting before replacing 5.12-rt patches with slub-local-lock-v2r3. box is old i4790 desktop perf stat -r10 hackbench -s4096 -l500 full warmup, record, repeat twice for elapsed Config has only SLUB+SLUB_DEBUG, as originally measured. 5.14.0.g79e92006-tip-rt (5.12-rt, 5.13-rt didn't exist when first measured) 7,984.52 msec task-clock # 7.565 CPUs utilized ( +- 0.66% ) 353,566 context-switches # 44.281 K/sec ( +- 2.77% ) 37,685 cpu-migrations # 4.720 K/sec ( +- 6.37% ) 12,939 page-faults # 1.620 K/sec ( +- 0.67% ) 29,901,079,227 cycles # 3.745 GHz ( +- 0.71% ) 14,550,797,818 instructions # 0.49 insn per cycle ( +- 0.47% ) 3,056,685,643 branches # 382.826 M/sec ( +- 0.51% ) 9,598,083 branch-misses # 0.31% of all branches ( +- 2.11% ) 1.05542 +- 0.00409 seconds time elapsed ( +- 0.39% ) 1.05990 +- 0.00244 seconds time elapsed ( +- 0.23% ) (repeat) 1.05367 +- 0.00303 seconds time elapsed ( +- 0.29% ) (repeat) 5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch 6,899.35 msec task-clock # 5.637 CPUs utilized ( +- 0.53% ) 420,304 context-switches # 60.919 K/sec ( +- 2.83% ) 187,130 cpu-migrations # 27.123 K/sec ( +- 1.81% ) 13,206 page-faults # 1.914 K/sec ( +- 0.96% ) 25,110,362,933 cycles # 3.640 GHz ( +- 0.49% ) 15,853,643,635 instructions # 0.63 insn per cycle ( +- 0.64% ) 3,366,261,524 branches # 487.910 M/sec ( +- 0.70% ) 14,839,618 branch-misses # 0.44% of all branches ( +- 2.01% ) 1.22390 +- 0.00744 seconds time elapsed ( +- 0.61% ) 1.21813 +- 0.00907 seconds time elapsed ( +- 0.74% ) (repeat) 1.22097 +- 0.00952 seconds time elapsed ( +- 0.78% ) (repeat) repeat of above with raw list_lock 8,072.62 msec task-clock # 7.605 CPUs utilized ( +- 0.49% ) 359,514 context-switches # 44.535 K/sec ( +- 4.95% ) 35,285 cpu-migrations # 4.371 K/sec ( +- 5.82% ) 13,503 page-faults # 1.673 K/sec ( +- 0.96% ) 30,247,989,681 cycles # 3.747 GHz ( +- 0.52% ) 14,580,011,391 instructions # 0.48 insn per cycle ( +- 0.81% ) 3,063,743,405 branches # 379.523 M/sec ( +- 0.85% ) 8,907,160 branch-misses # 0.29% of all branches ( +- 3.99% ) 1.06150 +- 0.00427 seconds time elapsed ( +- 0.40% ) 1.05041 +- 0.00176 seconds time elapsed ( +- 0.17% ) (repeat) 1.06086 +- 0.00237 seconds time elapsed ( +- 0.22% ) (repeat) 5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set 7,598.44 msec task-clock # 5.813 CPUs utilized ( +- 0.85% ) 488,161 context-switches # 64.245 K/sec ( +- 4.29% ) 196,866 cpu-migrations # 25.909 K/sec ( +- 1.49% ) 13,042 page-faults # 1.716 K/sec ( +- 0.73% ) 27,695,116,746 cycles # 3.645 GHz ( +- 0.79% ) 18,423,934,168 instructions # 0.67 insn per cycle ( +- 0.88% ) 3,969,540,695 branches # 522.415 M/sec ( +- 0.92% ) 15,493,482 branch-misses # 0.39% of all branches ( +- 2.15% ) 1.30709 +- 0.00890 seconds time elapsed ( +- 0.68% ) 1.3205 +- 0.0134 seconds time elapsed ( +- 1.02% ) (repeat) 1.3083 +- 0.0132 seconds time elapsed ( +- 1.01% ) (repeat)
now that I'm slowly catching up… On 2021-07-02 22:25:05 [+0200], Vlastimil Babka wrote: > > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500 > > Old: > > | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% ) > > New: > > | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% ) > > The series shouldn't significantly change the memory allocator > interaction, though. > Seems there's less cycles, but more time elapsed, thus more sleeping - > is it locks becoming mutexes on RT? yes, most likely since the !RT parts are mostly unchanged. > My second guess - list_lock remains spinlock with my series, thus RT > mutex, but the current RT tree converts it to raw_spinlock. I'd hope > leaving that one as non-raw spinlock would still be much better for RT > goals, even if hackbench (which is AFAIK very slab intensive) throughput > regresses - hopefully not that much. Yes, the list_lock seems to be the case. I picked your slub-local-lock-v3r0 and changed the list_lock (+slab_lock()) to use raw_spinlock_t and disable interrupts and CPUs utilisation went to ~23CPUs (plus a bunch of warnings which probably made it a little slower again). The difference between a sleeping lock (spinlock_t) and a mutex is that we attempt not to preempt a task that acquired a spinlock_t even if it is running for some time and the scheduler would preempt it (like it would do if the task had a mutex acquired. These are the "lazy preempt" bits in the RT patch). By making the list_lock a raw_spinlock_t a lot of IRQ-flags dancing needs to be done as the page-allocator must be entered with enabled interrupts. And then there is the possibility that you may need to free some memory even if you allocate memory which requires some extra steps on RT due to the IRQ-off part. All this vanishes by keeping list_lock a spinlock_t. The kernel-build test on /dev/shm remained unchanged so that is good. Unless there is a real-world use-case, that gets worse, I don't mind keeping the spinlock_t here. I haven't seen tglx complaining so far. Sebastian
On 7/29/21 3:49 PM, Sebastian Andrzej Siewior wrote: > now that I'm slowly catching up… > > On 2021-07-02 22:25:05 [+0200], Vlastimil Babka wrote: >> > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500 >> > Old: >> > | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% ) >> > New: >> > | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% ) >> >> The series shouldn't significantly change the memory allocator >> interaction, though. >> Seems there's less cycles, but more time elapsed, thus more sleeping - >> is it locks becoming mutexes on RT? > > yes, most likely since the !RT parts are mostly unchanged. > >> My second guess - list_lock remains spinlock with my series, thus RT >> mutex, but the current RT tree converts it to raw_spinlock. I'd hope >> leaving that one as non-raw spinlock would still be much better for RT >> goals, even if hackbench (which is AFAIK very slab intensive) throughput >> regresses - hopefully not that much. > > Yes, the list_lock seems to be the case. I picked your > slub-local-lock-v3r0 and changed the list_lock (+slab_lock()) to use > raw_spinlock_t and disable interrupts and CPUs utilisation went to > ~23CPUs (plus a bunch of warnings which probably made it a little slower > again). I forgot to point that out in the cover letter, but with v3 this change to raw_spinlock_t is AFAICS no longer possible (at least with CONFIG_SLUB_CPU_PARTIAL) because in put_cpu_partial() we now take the local_lock and it can be called from get_partial_node() which takes the list_lock. > The difference between a sleeping lock (spinlock_t) and a mutex is > that we attempt not to preempt a task that acquired a spinlock_t even if > it is running for some time and the scheduler would preempt it (like it > would do if the task had a mutex acquired. These are the "lazy preempt" > bits in the RT patch). > > By making the list_lock a raw_spinlock_t a lot of IRQ-flags dancing > needs to be done as the page-allocator must be entered with enabled > interrupts. Hm but SLUB should never call the page allocator from under list_lock in my series? > And then there is the possibility that you may need to free > some memory even if you allocate memory which requires some extra steps > on RT due to the IRQ-off part. All this vanishes by keeping list_lock a > spinlock_t. > The kernel-build test on /dev/shm remained unchanged so that is good. > Unless there is a real-world use-case, that gets worse, I don't mind > keeping the spinlock_t here. I haven't seen tglx complaining so far. Good. IIRC hackbench is very close to being a slab microbenchmark, so regressions there are expected, but should not translate to notable real world regressions. > Sebastian >
On 2021-07-29 16:17:26 [+0200], Vlastimil Babka wrote: > I forgot to point that out in the cover letter, but with v3 this change to > raw_spinlock_t is AFAICS no longer possible (at least with > CONFIG_SLUB_CPU_PARTIAL) because in put_cpu_partial() we now take the local_lock > and it can be called from get_partial_node() which takes the list_lock. I saw increased latency numbers with CONFIG_SLUB_CPU_PARTIAL before it got disabled for other reasons so I'm not too sad if it remains disabled. > Hm but SLUB should never call the page allocator from under list_lock in my series? oh yes. I run into CPU_PARTIAL instead. Sorry for the confusion. So without PARTIAL it says "24,643 CPUs utilized" and no warnings :) Sebastian