Message ID | 20210408035736.883861-1-guro@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | percpu: partial chunk depopulation | expand |
Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? I will also look through the code to find the reason why POWER isn't depopulating pages. Thank you, Pratik On 08/04/21 9:27 am, Roman Gushchin wrote: > In our production experience the percpu memory allocator is sometimes struggling > with returning the memory to the system. A typical example is a creation of > several thousands memory cgroups (each has several chunks of the percpu data > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing > of these cgroups doesn't always lead to a shrinkage of the percpu memory, > so that sometimes there are several GB's of memory wasted. > > The underlying problem is the fragmentation: to release an underlying chunk > all percpu allocations should be released first. The percpu allocator tends > to top up chunks to improve the utilization. It means new small-ish allocations > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, > effectively pinning them in memory. > > This patchset solves this problem by implementing a partial depopulation > of percpu chunks: chunks with many empty pages are being asynchronously > depopulated and the pages are returned to the system. > > To illustrate the problem the following script can be used: > > -- > #!/bin/bash > > cd /sys/fs/cgroup > > mkdir percpu_test > echo "+memory" > percpu_test/cgroup.subtree_control > > cat /proc/meminfo | grep Percpu > > for i in `seq 1 1000`; do > mkdir percpu_test/cg_"${i}" > for j in `seq 1 10`; do > mkdir percpu_test/cg_"${i}"_"${j}" > done > done > > cat /proc/meminfo | grep Percpu > > for i in `seq 1 1000`; do > for j in `seq 1 10`; do > rmdir percpu_test/cg_"${i}"_"${j}" > done > done > > sleep 10 > > cat /proc/meminfo | grep Percpu > > for i in `seq 1 1000`; do > rmdir percpu_test/cg_"${i}" > done > > rmdir percpu_test > -- > > It creates 11000 memory cgroups and removes every 10 out of 11. > It prints the initial size of the percpu memory, the size after > creating all cgroups and the size after deleting most of them. > > Results: > vanilla: > ./percpu_test.sh > Percpu: 7488 kB > Percpu: 481152 kB > Percpu: 481152 kB > > with this patchset applied: > ./percpu_test.sh > Percpu: 7488 kB > Percpu: 481408 kB > Percpu: 135552 kB > > So the total size of the percpu memory was reduced by more than 3.5 times. > > v3: > - introduced pcpu_check_chunk_hint() > - fixed a bug related to the hint check > - minor cosmetic changes > - s/pretends/fixes (cc Vlastimil) > > v2: > - depopulated chunks are sidelined > - depopulation happens in the reverse order > - depopulate list made per-chunk type > - better results due to better heuristics > > v1: > - depopulation heuristics changed and optimized > - chunks are put into a separate list, depopulation scan this list > - chunk->isolated is introduced, chunk->depopulate is dropped > - rearranged patches a bit > - fixed a panic discovered by krobot > - made pcpu_nr_empty_pop_pages per chunk type > - minor fixes > > rfc: > https://lwn.net/Articles/850508/ > > > Roman Gushchin (6): > percpu: fix a comment about the chunks ordering > percpu: split __pcpu_balance_workfn() > percpu: make pcpu_nr_empty_pop_pages per chunk type > percpu: generalize pcpu_balance_populated() > percpu: factor out pcpu_check_chunk_hint() > percpu: implement partial chunk depopulation > > mm/percpu-internal.h | 4 +- > mm/percpu-stats.c | 9 +- > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++-------- > 3 files changed, 261 insertions(+), 58 deletions(-) >
Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > Hello Roman, > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > My results of the percpu_test are as follows: > Intel KVM 4CPU:4G > Vanilla 5.12-rc6 > # ./percpu_test.sh > Percpu: 1952 kB > Percpu: 219648 kB > Percpu: 219648 kB > > 5.12-rc6 + with patchset applied > # ./percpu_test.sh > Percpu: 2080 kB > Percpu: 219712 kB > Percpu: 72672 kB > > I'm able to see improvement comparable to that of what you're see too. > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > POWER9 KVM 4CPU:4G > Vanilla 5.12-rc6 > # ./percpu_test.sh > Percpu: 5888 kB > Percpu: 118272 kB > Percpu: 118272 kB > > 5.12-rc6 + with patchset applied > # ./percpu_test.sh > Percpu: 6144 kB > Percpu: 119040 kB > Percpu: 119040 kB > > I'm wondering if there's any architectural specific code that needs plumbing > here? > There shouldn't be. Can you send me the percpu_stats debug output before and after? > I will also look through the code to find the reason why POWER isn't > depopulating pages. > > Thank you, > Pratik > > On 08/04/21 9:27 am, Roman Gushchin wrote: > > In our production experience the percpu memory allocator is sometimes struggling > > with returning the memory to the system. A typical example is a creation of > > several thousands memory cgroups (each has several chunks of the percpu data > > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing > > of these cgroups doesn't always lead to a shrinkage of the percpu memory, > > so that sometimes there are several GB's of memory wasted. > > > > The underlying problem is the fragmentation: to release an underlying chunk > > all percpu allocations should be released first. The percpu allocator tends > > to top up chunks to improve the utilization. It means new small-ish allocations > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, > > effectively pinning them in memory. > > > > This patchset solves this problem by implementing a partial depopulation > > of percpu chunks: chunks with many empty pages are being asynchronously > > depopulated and the pages are returned to the system. > > > > To illustrate the problem the following script can be used: > > > > -- > > #!/bin/bash > > > > cd /sys/fs/cgroup > > > > mkdir percpu_test > > echo "+memory" > percpu_test/cgroup.subtree_control > > > > cat /proc/meminfo | grep Percpu > > > > for i in `seq 1 1000`; do > > mkdir percpu_test/cg_"${i}" > > for j in `seq 1 10`; do > > mkdir percpu_test/cg_"${i}"_"${j}" > > done > > done > > > > cat /proc/meminfo | grep Percpu > > > > for i in `seq 1 1000`; do > > for j in `seq 1 10`; do > > rmdir percpu_test/cg_"${i}"_"${j}" > > done > > done > > > > sleep 10 > > > > cat /proc/meminfo | grep Percpu > > > > for i in `seq 1 1000`; do > > rmdir percpu_test/cg_"${i}" > > done > > > > rmdir percpu_test > > -- > > > > It creates 11000 memory cgroups and removes every 10 out of 11. > > It prints the initial size of the percpu memory, the size after > > creating all cgroups and the size after deleting most of them. > > > > Results: > > vanilla: > > ./percpu_test.sh > > Percpu: 7488 kB > > Percpu: 481152 kB > > Percpu: 481152 kB > > > > with this patchset applied: > > ./percpu_test.sh > > Percpu: 7488 kB > > Percpu: 481408 kB > > Percpu: 135552 kB > > > > So the total size of the percpu memory was reduced by more than 3.5 times. > > > > v3: > > - introduced pcpu_check_chunk_hint() > > - fixed a bug related to the hint check > > - minor cosmetic changes > > - s/pretends/fixes (cc Vlastimil) > > > > v2: > > - depopulated chunks are sidelined > > - depopulation happens in the reverse order > > - depopulate list made per-chunk type > > - better results due to better heuristics > > > > v1: > > - depopulation heuristics changed and optimized > > - chunks are put into a separate list, depopulation scan this list > > - chunk->isolated is introduced, chunk->depopulate is dropped > > - rearranged patches a bit > > - fixed a panic discovered by krobot > > - made pcpu_nr_empty_pop_pages per chunk type > > - minor fixes > > > > rfc: > > https://lwn.net/Articles/850508/ > > > > > > Roman Gushchin (6): > > percpu: fix a comment about the chunks ordering > > percpu: split __pcpu_balance_workfn() > > percpu: make pcpu_nr_empty_pop_pages per chunk type > > percpu: generalize pcpu_balance_populated() > > percpu: factor out pcpu_check_chunk_hint() > > percpu: implement partial chunk depopulation > > > > mm/percpu-internal.h | 4 +- > > mm/percpu-stats.c | 9 +- > > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++-------- > > 3 files changed, 261 insertions(+), 58 deletions(-) > > > Roman, sorry for the delay. I'm looking to apply this today to for-5.14. Thanks, Dennis
Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: > Hello, > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: >> Hello Roman, >> >> I've tried the v3 patch series on a POWER9 and an x86 KVM setup. >> >> My results of the percpu_test are as follows: >> Intel KVM 4CPU:4G >> Vanilla 5.12-rc6 >> # ./percpu_test.sh >> Percpu: 1952 kB >> Percpu: 219648 kB >> Percpu: 219648 kB >> >> 5.12-rc6 + with patchset applied >> # ./percpu_test.sh >> Percpu: 2080 kB >> Percpu: 219712 kB >> Percpu: 72672 kB >> >> I'm able to see improvement comparable to that of what you're see too. >> >> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration >> >> POWER9 KVM 4CPU:4G >> Vanilla 5.12-rc6 >> # ./percpu_test.sh >> Percpu: 5888 kB >> Percpu: 118272 kB >> Percpu: 118272 kB >> >> 5.12-rc6 + with patchset applied >> # ./percpu_test.sh >> Percpu: 6144 kB >> Percpu: 119040 kB >> Percpu: 119040 kB >> >> I'm wondering if there's any architectural specific code that needs plumbing >> here? >> > There shouldn't be. Can you send me the percpu_stats debug output before > and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -----BEFORE----- Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 9040 nr_dealloc : 6994 nr_cur_alloc : 2046 nr_max_alloc : 2208 nr_chunks : 3 nr_max_chunks : 3 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 12 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 859 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16384 free_bytes : 0 contig_bytes : 0 sum_frag : 0 max_frag : 0 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 827 max_alloc_size : 992 empty_pop_pages : 8 first_bit : 692 free_bytes : 645012 contig_bytes : 460096 sum_frag : 466420 max_frag : 460096 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 152 memcg_aware : 0 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 4 first_bit : 29207 free_bytes : 506640 contig_bytes : 506556 sum_frag : 84 max_frag : 32 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 -----AFTER----- Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 97048 nr_dealloc : 95002 nr_cur_alloc : 2046 nr_max_alloc : 90054 nr_chunks : 48 nr_max_chunks : 48 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 61 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 859 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16384 free_bytes : 0 contig_bytes : 0 sum_frag : 0 max_frag : 0 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 827 max_alloc_size : 1072 empty_pop_pages : 8 first_bit : 692 free_bytes : 645012 contig_bytes : 460096 sum_frag : 466420 max_frag : 460096 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 152 memcg_aware : 0 Chunk: nr_alloc : 0 max_alloc_size : 0 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 0 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 7 first_bit : 29207 free_bytes : 506640 contig_bytes : 506556 sum_frag : 84 max_frag : 32 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 I'm also pasting the logs before and after in a vanilla kernel too There are considerably higher number of chunks in the vanilla kernel, than with the patches though. 5.12-rc6 vanilla -----BEFORE----- Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 9038 nr_dealloc : 6992 nr_cur_alloc : 2046 nr_max_alloc : 2178 nr_chunks : 3 nr_max_chunks : 3 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 5 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 1088 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16384 free_bytes : 0 contig_bytes : 0 sum_frag : 0 max_frag : 0 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 598 max_alloc_size : 992 empty_pop_pages : 5 first_bit : 642 free_bytes : 645012 contig_bytes : 504292 sum_frag : 140720 max_frag : 116456 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 424 memcg_aware : 0 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 27909 free_bytes : 506640 contig_bytes : 506556 sum_frag : 84 max_frag : 36 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 -----AFTER----- Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 97046 nr_dealloc : 94237 nr_cur_alloc : 2809 nr_max_alloc : 90054 nr_chunks : 11 nr_max_chunks : 47 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 29 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 1088 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16384 free_bytes : 0 contig_bytes : 0 sum_frag : 0 max_frag : 0 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 865 max_alloc_size : 1072 empty_pop_pages : 6 first_bit : 789 free_bytes : 640296 contig_bytes : 290672 sum_frag : 349624 max_frag : 169956 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 536 free_bytes : 595752 contig_bytes : 26164 sum_frag : 575132 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 1072 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 597428 contig_bytes : 26164 sum_frag : 596848 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 92 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595284 contig_bytes : 26164 sum_frag : 590360 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 92 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595284 contig_bytes : 26164 sum_frag : 583768 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595752 contig_bytes : 26164 sum_frag : 577748 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 1072 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 30 max_alloc_size : 1072 empty_pop_pages : 6 first_bit : 0 free_bytes : 636608 contig_bytes : 397944 sum_frag : 636500 max_frag : 426720 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 7 first_bit : 27909 free_bytes : 506640 contig_bytes : 506556 sum_frag : 84 max_frag : 36 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 12 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 647524 contig_bytes : 563492 sum_frag : 57872 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 10 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 >> I will also look through the code to find the reason why POWER isn't >> depopulating pages. >> >> Thank you, >> Pratik >> >> On 08/04/21 9:27 am, Roman Gushchin wrote: >>> In our production experience the percpu memory allocator is sometimes struggling >>> with returning the memory to the system. A typical example is a creation of >>> several thousands memory cgroups (each has several chunks of the percpu data >>> used for vmstats, vmevents, ref counters etc). Deletion and complete releasing >>> of these cgroups doesn't always lead to a shrinkage of the percpu memory, >>> so that sometimes there are several GB's of memory wasted. >>> >>> The underlying problem is the fragmentation: to release an underlying chunk >>> all percpu allocations should be released first. The percpu allocator tends >>> to top up chunks to improve the utilization. It means new small-ish allocations >>> (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, >>> effectively pinning them in memory. >>> >>> This patchset solves this problem by implementing a partial depopulation >>> of percpu chunks: chunks with many empty pages are being asynchronously >>> depopulated and the pages are returned to the system. >>> >>> To illustrate the problem the following script can be used: >>> >>> -- >>> #!/bin/bash >>> >>> cd /sys/fs/cgroup >>> >>> mkdir percpu_test >>> echo "+memory" > percpu_test/cgroup.subtree_control >>> >>> cat /proc/meminfo | grep Percpu >>> >>> for i in `seq 1 1000`; do >>> mkdir percpu_test/cg_"${i}" >>> for j in `seq 1 10`; do >>> mkdir percpu_test/cg_"${i}"_"${j}" >>> done >>> done >>> >>> cat /proc/meminfo | grep Percpu >>> >>> for i in `seq 1 1000`; do >>> for j in `seq 1 10`; do >>> rmdir percpu_test/cg_"${i}"_"${j}" >>> done >>> done >>> >>> sleep 10 >>> >>> cat /proc/meminfo | grep Percpu >>> >>> for i in `seq 1 1000`; do >>> rmdir percpu_test/cg_"${i}" >>> done >>> >>> rmdir percpu_test >>> -- >>> >>> It creates 11000 memory cgroups and removes every 10 out of 11. >>> It prints the initial size of the percpu memory, the size after >>> creating all cgroups and the size after deleting most of them. >>> >>> Results: >>> vanilla: >>> ./percpu_test.sh >>> Percpu: 7488 kB >>> Percpu: 481152 kB >>> Percpu: 481152 kB >>> >>> with this patchset applied: >>> ./percpu_test.sh >>> Percpu: 7488 kB >>> Percpu: 481408 kB >>> Percpu: 135552 kB >>> >>> So the total size of the percpu memory was reduced by more than 3.5 times. >>> >>> v3: >>> - introduced pcpu_check_chunk_hint() >>> - fixed a bug related to the hint check >>> - minor cosmetic changes >>> - s/pretends/fixes (cc Vlastimil) >>> >>> v2: >>> - depopulated chunks are sidelined >>> - depopulation happens in the reverse order >>> - depopulate list made per-chunk type >>> - better results due to better heuristics >>> >>> v1: >>> - depopulation heuristics changed and optimized >>> - chunks are put into a separate list, depopulation scan this list >>> - chunk->isolated is introduced, chunk->depopulate is dropped >>> - rearranged patches a bit >>> - fixed a panic discovered by krobot >>> - made pcpu_nr_empty_pop_pages per chunk type >>> - minor fixes >>> >>> rfc: >>> https://lwn.net/Articles/850508/ >>> >>> >>> Roman Gushchin (6): >>> percpu: fix a comment about the chunks ordering >>> percpu: split __pcpu_balance_workfn() >>> percpu: make pcpu_nr_empty_pop_pages per chunk type >>> percpu: generalize pcpu_balance_populated() >>> percpu: factor out pcpu_check_chunk_hint() >>> percpu: implement partial chunk depopulation >>> >>> mm/percpu-internal.h | 4 +- >>> mm/percpu-stats.c | 9 +- >>> mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++-------- >>> 3 files changed, 261 insertions(+), 58 deletions(-) >>> > Roman, sorry for the delay. I'm looking to apply this today to for-5.14. > > Thanks, > Dennis Thanks Pratik
On Fri, Apr 16, 2021 at 02:18:10PM +0000, Dennis Zhou wrote: > Hello, > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > Hello Roman, > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > My results of the percpu_test are as follows: > > Intel KVM 4CPU:4G > > Vanilla 5.12-rc6 > > # ./percpu_test.sh > > Percpu: 1952 kB > > Percpu: 219648 kB > > Percpu: 219648 kB > > > > 5.12-rc6 + with patchset applied > > # ./percpu_test.sh > > Percpu: 2080 kB > > Percpu: 219712 kB > > Percpu: 72672 kB > > > > I'm able to see improvement comparable to that of what you're see too. > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > > > POWER9 KVM 4CPU:4G > > Vanilla 5.12-rc6 > > # ./percpu_test.sh > > Percpu: 5888 kB > > Percpu: 118272 kB > > Percpu: 118272 kB > > > > 5.12-rc6 + with patchset applied > > # ./percpu_test.sh > > Percpu: 6144 kB > > Percpu: 119040 kB > > Percpu: 119040 kB > > > > I'm wondering if there's any architectural specific code that needs plumbing > > here? > > > > There shouldn't be. Can you send me the percpu_stats debug output before > and after? Btw, sidelined chunks are not listed in the debug output. It was actually on my to-do list, looks like I need to prioritize it a bit. > > > I will also look through the code to find the reason why POWER isn't > > depopulating pages. > > > > Thank you, > > Pratik > > > > On 08/04/21 9:27 am, Roman Gushchin wrote: > > > In our production experience the percpu memory allocator is sometimes struggling > > > with returning the memory to the system. A typical example is a creation of > > > several thousands memory cgroups (each has several chunks of the percpu data > > > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing > > > of these cgroups doesn't always lead to a shrinkage of the percpu memory, > > > so that sometimes there are several GB's of memory wasted. > > > > > > The underlying problem is the fragmentation: to release an underlying chunk > > > all percpu allocations should be released first. The percpu allocator tends > > > to top up chunks to improve the utilization. It means new small-ish allocations > > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, > > > effectively pinning them in memory. > > > > > > This patchset solves this problem by implementing a partial depopulation > > > of percpu chunks: chunks with many empty pages are being asynchronously > > > depopulated and the pages are returned to the system. > > > > > > To illustrate the problem the following script can be used: > > > > > > -- > > > #!/bin/bash > > > > > > cd /sys/fs/cgroup > > > > > > mkdir percpu_test > > > echo "+memory" > percpu_test/cgroup.subtree_control > > > > > > cat /proc/meminfo | grep Percpu > > > > > > for i in `seq 1 1000`; do > > > mkdir percpu_test/cg_"${i}" > > > for j in `seq 1 10`; do > > > mkdir percpu_test/cg_"${i}"_"${j}" > > > done > > > done > > > > > > cat /proc/meminfo | grep Percpu > > > > > > for i in `seq 1 1000`; do > > > for j in `seq 1 10`; do > > > rmdir percpu_test/cg_"${i}"_"${j}" > > > done > > > done > > > > > > sleep 10 > > > > > > cat /proc/meminfo | grep Percpu > > > > > > for i in `seq 1 1000`; do > > > rmdir percpu_test/cg_"${i}" > > > done > > > > > > rmdir percpu_test > > > -- > > > > > > It creates 11000 memory cgroups and removes every 10 out of 11. > > > It prints the initial size of the percpu memory, the size after > > > creating all cgroups and the size after deleting most of them. > > > > > > Results: > > > vanilla: > > > ./percpu_test.sh > > > Percpu: 7488 kB > > > Percpu: 481152 kB > > > Percpu: 481152 kB > > > > > > with this patchset applied: > > > ./percpu_test.sh > > > Percpu: 7488 kB > > > Percpu: 481408 kB > > > Percpu: 135552 kB > > > > > > So the total size of the percpu memory was reduced by more than 3.5 times. > > > > > > v3: > > > - introduced pcpu_check_chunk_hint() > > > - fixed a bug related to the hint check > > > - minor cosmetic changes > > > - s/pretends/fixes (cc Vlastimil) > > > > > > v2: > > > - depopulated chunks are sidelined > > > - depopulation happens in the reverse order > > > - depopulate list made per-chunk type > > > - better results due to better heuristics > > > > > > v1: > > > - depopulation heuristics changed and optimized > > > - chunks are put into a separate list, depopulation scan this list > > > - chunk->isolated is introduced, chunk->depopulate is dropped > > > - rearranged patches a bit > > > - fixed a panic discovered by krobot > > > - made pcpu_nr_empty_pop_pages per chunk type > > > - minor fixes > > > > > > rfc: > > > https://lwn.net/Articles/850508/ > > > > > > > > > Roman Gushchin (6): > > > percpu: fix a comment about the chunks ordering > > > percpu: split __pcpu_balance_workfn() > > > percpu: make pcpu_nr_empty_pop_pages per chunk type > > > percpu: generalize pcpu_balance_populated() > > > percpu: factor out pcpu_check_chunk_hint() > > > percpu: implement partial chunk depopulation > > > > > > mm/percpu-internal.h | 4 +- > > > mm/percpu-stats.c | 9 +- > > > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++-------- > > > 3 files changed, 261 insertions(+), 58 deletions(-) > > > > > > > Roman, sorry for the delay. I'm looking to apply this today to for-5.14. Great, thanks!
On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: > Hello Dennis, > > I apologize for the clutter of logs before, I'm pasting the logs of before and > after the percpu test in the case of the patchset being applied on 5.12-rc6 and > the vanilla kernel 5.12-rc6. > > On 16/04/21 7:48 pm, Dennis Zhou wrote: > > Hello, > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > > Hello Roman, > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > > > My results of the percpu_test are as follows: > > > Intel KVM 4CPU:4G > > > Vanilla 5.12-rc6 > > > # ./percpu_test.sh > > > Percpu: 1952 kB > > > Percpu: 219648 kB > > > Percpu: 219648 kB > > > > > > 5.12-rc6 + with patchset applied > > > # ./percpu_test.sh > > > Percpu: 2080 kB > > > Percpu: 219712 kB > > > Percpu: 72672 kB > > > > > > I'm able to see improvement comparable to that of what you're see too. > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > > > > > POWER9 KVM 4CPU:4G > > > Vanilla 5.12-rc6 > > > # ./percpu_test.sh > > > Percpu: 5888 kB > > > Percpu: 118272 kB > > > Percpu: 118272 kB > > > > > > 5.12-rc6 + with patchset applied > > > # ./percpu_test.sh > > > Percpu: 6144 kB > > > Percpu: 119040 kB > > > Percpu: 119040 kB > > > > > > I'm wondering if there's any architectural specific code that needs plumbing > > > here? > > > > > There shouldn't be. Can you send me the percpu_stats debug output before > > and after? > > I'll paste the whole debug stats before and after here. > 5.12-rc6 + patchset > -----BEFORE----- > Percpu Memory Statistics > Allocation Info: Hm, this looks highly suspicious. Here is your stats in a more compact form: Vanilla nr_alloc : 9038 nr_alloc : 97046 nr_dealloc : 6992 nr_dealloc : 94237 nr_cur_alloc : 2046 nr_cur_alloc : 2809 nr_max_alloc : 2178 nr_max_alloc : 90054 nr_chunks : 3 nr_chunks : 11 nr_max_chunks : 3 nr_max_chunks : 47 min_alloc_size : 4 min_alloc_size : 4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 5 empty_pop_pages : 29 Patched nr_alloc : 9040 nr_alloc : 97048 nr_dealloc : 6994 nr_dealloc : 95002 nr_cur_alloc : 2046 nr_cur_alloc : 2046 nr_max_alloc : 2208 nr_max_alloc : 90054 nr_chunks : 3 nr_chunks : 48 nr_max_chunks : 3 nr_max_chunks : 48 min_alloc_size : 4 min_alloc_size : 4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 61 So it looks like the number of chunks got bigger, as well as the number of empty_pop_pages? This contradicts to what you wrote, so can you, please, make sure that the data is correct and we're not messing two cases? So it looks like for some reason sidelined (depopulated) chunks are not getting freed completely. But I struggle to explain why the initial empty_pop_pages is bigger with the same amount of chunks. So, can you, please, apply the following patch and provide an updated statistics? -- From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001 From: Roman Gushchin <guro@fb.com> Date: Fri, 16 Apr 2021 09:54:38 -0700 Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug output Information about sidelined chunks and chunks in the depopulate queue could be extremely valuable for debugging different problems. Dump information about these chunks on pair with regular chunks in percpu slots via percpu stats interface. Signed-off-by: Roman Gushchin <guro@fb.com> --- mm/percpu-internal.h | 2 ++ mm/percpu-stats.c | 10 ++++++++++ mm/percpu.c | 4 ++-- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 8e432663c41e..c11f115ced5c 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -90,6 +90,8 @@ extern spinlock_t pcpu_lock; extern struct list_head *pcpu_chunk_lists; extern int pcpu_nr_slots; extern int pcpu_nr_empty_pop_pages[]; +extern struct list_head pcpu_depopulate_list[]; +extern struct list_head pcpu_sideline_list[]; extern struct pcpu_chunk *pcpu_first_chunk; extern struct pcpu_chunk *pcpu_reserved_chunk; diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index f6026dbcdf6b..af09ed1ea5f8 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -228,6 +228,16 @@ static int percpu_stats_show(struct seq_file *m, void *v) } } } + + list_for_each_entry(chunk, &pcpu_sideline_list[type], list) { + seq_puts(m, "Chunk (sidelined):\n"); + chunk_map_stats(m, chunk, buffer); + } + + list_for_each_entry(chunk, &pcpu_depopulate_list[type], list) { + seq_puts(m, "Chunk (to depopulate):\n"); + chunk_map_stats(m, chunk, buffer); + } } spin_unlock_irq(&pcpu_lock); diff --git a/mm/percpu.c b/mm/percpu.c index 5bb294e394b3..ded3a7541cb2 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -185,13 +185,13 @@ int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES]; * List of chunks with a lot of free pages. Used to depopulate them * asynchronously. */ -static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES]; +struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES]; /* * List of previously depopulated chunks. They are not usually used for new * allocations, but can be returned back to service if a need arises. */ -static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES]; +struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES]; /*
On 16/04/21 10:43 pm, Roman Gushchin wrote: > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: >> Hello Dennis, >> >> I apologize for the clutter of logs before, I'm pasting the logs of before and >> after the percpu test in the case of the patchset being applied on 5.12-rc6 and >> the vanilla kernel 5.12-rc6. >> >> On 16/04/21 7:48 pm, Dennis Zhou wrote: >>> Hello, >>> >>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: >>>> Hello Roman, >>>> >>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup. >>>> >>>> My results of the percpu_test are as follows: >>>> Intel KVM 4CPU:4G >>>> Vanilla 5.12-rc6 >>>> # ./percpu_test.sh >>>> Percpu: 1952 kB >>>> Percpu: 219648 kB >>>> Percpu: 219648 kB >>>> >>>> 5.12-rc6 + with patchset applied >>>> # ./percpu_test.sh >>>> Percpu: 2080 kB >>>> Percpu: 219712 kB >>>> Percpu: 72672 kB >>>> >>>> I'm able to see improvement comparable to that of what you're see too. >>>> >>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration >>>> >>>> POWER9 KVM 4CPU:4G >>>> Vanilla 5.12-rc6 >>>> # ./percpu_test.sh >>>> Percpu: 5888 kB >>>> Percpu: 118272 kB >>>> Percpu: 118272 kB >>>> >>>> 5.12-rc6 + with patchset applied >>>> # ./percpu_test.sh >>>> Percpu: 6144 kB >>>> Percpu: 119040 kB >>>> Percpu: 119040 kB >>>> >>>> I'm wondering if there's any architectural specific code that needs plumbing >>>> here? >>>> >>> There shouldn't be. Can you send me the percpu_stats debug output before >>> and after? >> I'll paste the whole debug stats before and after here. >> 5.12-rc6 + patchset >> -----BEFORE----- >> Percpu Memory Statistics >> Allocation Info: > > Hm, this looks highly suspicious. Here is your stats in a more compact form: > > Vanilla > > nr_alloc : 9038 nr_alloc : 97046 > nr_dealloc : 6992 nr_dealloc : 94237 > nr_cur_alloc : 2046 nr_cur_alloc : 2809 > nr_max_alloc : 2178 nr_max_alloc : 90054 > nr_chunks : 3 nr_chunks : 11 > nr_max_chunks : 3 nr_max_chunks : 47 > min_alloc_size : 4 min_alloc_size : 4 > max_alloc_size : 1072 max_alloc_size : 1072 > empty_pop_pages : 5 empty_pop_pages : 29 > > > Patched > > nr_alloc : 9040 nr_alloc : 97048 > nr_dealloc : 6994 nr_dealloc : 95002 > nr_cur_alloc : 2046 nr_cur_alloc : 2046 > nr_max_alloc : 2208 nr_max_alloc : 90054 > nr_chunks : 3 nr_chunks : 48 > nr_max_chunks : 3 nr_max_chunks : 48 > min_alloc_size : 4 min_alloc_size : 4 > max_alloc_size : 1072 max_alloc_size : 1072 > empty_pop_pages : 12 empty_pop_pages : 61 > > > So it looks like the number of chunks got bigger, as well as the number of > empty_pop_pages? This contradicts to what you wrote, so can you, please, make > sure that the data is correct and we're not messing two cases? > > So it looks like for some reason sidelined (depopulated) chunks are not getting > freed completely. But I struggle to explain why the initial empty_pop_pages is > bigger with the same amount of chunks. > > So, can you, please, apply the following patch and provide an updated statistics? Unfortunately, I'm not completely well versed in this area, but yes the empty pop pages number doesn't make sense to me either. I re-ran the numbers trying to make sure my experiment setup is sane but results remain the same. Vanilla nr_alloc : 9040 nr_alloc : 97048 nr_dealloc : 6994 nr_dealloc : 94404 nr_cur_alloc : 2046 nr_cur_alloc : 2644 nr_max_alloc : 2169 nr_max_alloc : 90054 nr_chunks : 3 nr_chunks : 10 nr_max_chunks : 3 nr_max_chunks : 47 min_alloc_size : 4 min_alloc_size : 4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 4 empty_pop_pages : 32 With the patchset + debug patch the results are as follows: Patched nr_alloc : 9040 nr_alloc : 97048 nr_dealloc : 6994 nr_dealloc : 94349 nr_cur_alloc : 2046 nr_cur_alloc : 2699 nr_max_alloc : 2194 nr_max_alloc : 90054 nr_chunks : 3 nr_chunks : 48 nr_max_chunks : 3 nr_max_chunks : 48 min_alloc_size : 4 min_alloc_size : 4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 54 With the extra tracing I can see 39 entries of "Chunk (sidelined)" after the test was run. I don't see any entries for "Chunk (to depopulate)" I've snipped the results of slidelined chunks because they went on for ~600 lines, if you need the full logs let me know. Thank you, Pratik > -- > > From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001 > From: Roman Gushchin <guro@fb.com> > Date: Fri, 16 Apr 2021 09:54:38 -0700 > Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug > output > > Information about sidelined chunks and chunks in the depopulate queue > could be extremely valuable for debugging different problems. > > Dump information about these chunks on pair with regular chunks > in percpu slots via percpu stats interface. > > Signed-off-by: Roman Gushchin <guro@fb.com> > --- > mm/percpu-internal.h | 2 ++ > mm/percpu-stats.c | 10 ++++++++++ > mm/percpu.c | 4 ++-- > 3 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h > index 8e432663c41e..c11f115ced5c 100644 > --- a/mm/percpu-internal.h > +++ b/mm/percpu-internal.h > @@ -90,6 +90,8 @@ extern spinlock_t pcpu_lock; > extern struct list_head *pcpu_chunk_lists; > extern int pcpu_nr_slots; > extern int pcpu_nr_empty_pop_pages[]; > +extern struct list_head pcpu_depopulate_list[]; > +extern struct list_head pcpu_sideline_list[]; > > extern struct pcpu_chunk *pcpu_first_chunk; > extern struct pcpu_chunk *pcpu_reserved_chunk; > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c > index f6026dbcdf6b..af09ed1ea5f8 100644 > --- a/mm/percpu-stats.c > +++ b/mm/percpu-stats.c > @@ -228,6 +228,16 @@ static int percpu_stats_show(struct seq_file *m, void *v) > } > } > } > + > + list_for_each_entry(chunk, &pcpu_sideline_list[type], list) { > + seq_puts(m, "Chunk (sidelined):\n"); > + chunk_map_stats(m, chunk, buffer); > + } > + > + list_for_each_entry(chunk, &pcpu_depopulate_list[type], list) { > + seq_puts(m, "Chunk (to depopulate):\n"); > + chunk_map_stats(m, chunk, buffer); > + } > } > > spin_unlock_irq(&pcpu_lock); > diff --git a/mm/percpu.c b/mm/percpu.c > index 5bb294e394b3..ded3a7541cb2 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -185,13 +185,13 @@ int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES]; > * List of chunks with a lot of free pages. Used to depopulate them > * asynchronously. > */ > -static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES]; > +struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES]; > > /* > * List of previously depopulated chunks. They are not usually used for new > * allocations, but can be returned back to service if a need arises. > */ > -static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES]; > +struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES]; > > > /*
On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: > > > On 16/04/21 10:43 pm, Roman Gushchin wrote: > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: > > > Hello Dennis, > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and > > > the vanilla kernel 5.12-rc6. > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote: > > > > Hello, > > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > > > > Hello Roman, > > > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > > > > > > > My results of the percpu_test are as follows: > > > > > Intel KVM 4CPU:4G > > > > > Vanilla 5.12-rc6 > > > > > # ./percpu_test.sh > > > > > Percpu: 1952 kB > > > > > Percpu: 219648 kB > > > > > Percpu: 219648 kB > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > # ./percpu_test.sh > > > > > Percpu: 2080 kB > > > > > Percpu: 219712 kB > > > > > Percpu: 72672 kB > > > > > > > > > > I'm able to see improvement comparable to that of what you're see too. > > > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > > > > > > > > > POWER9 KVM 4CPU:4G > > > > > Vanilla 5.12-rc6 > > > > > # ./percpu_test.sh > > > > > Percpu: 5888 kB > > > > > Percpu: 118272 kB > > > > > Percpu: 118272 kB > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > # ./percpu_test.sh > > > > > Percpu: 6144 kB > > > > > Percpu: 119040 kB > > > > > Percpu: 119040 kB > > > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing > > > > > here? > > > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before > > > > and after? > > > I'll paste the whole debug stats before and after here. > > > 5.12-rc6 + patchset > > > -----BEFORE----- > > > Percpu Memory Statistics > > > Allocation Info: > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form: > > > > Vanilla > > > > nr_alloc : 9038 nr_alloc : 97046 > > nr_dealloc : 6992 nr_dealloc : 94237 > > nr_cur_alloc : 2046 nr_cur_alloc : 2809 > > nr_max_alloc : 2178 nr_max_alloc : 90054 > > nr_chunks : 3 nr_chunks : 11 > > nr_max_chunks : 3 nr_max_chunks : 47 > > min_alloc_size : 4 min_alloc_size : 4 > > max_alloc_size : 1072 max_alloc_size : 1072 > > empty_pop_pages : 5 empty_pop_pages : 29 > > > > > > Patched > > > > nr_alloc : 9040 nr_alloc : 97048 > > nr_dealloc : 6994 nr_dealloc : 95002 > > nr_cur_alloc : 2046 nr_cur_alloc : 2046 > > nr_max_alloc : 2208 nr_max_alloc : 90054 > > nr_chunks : 3 nr_chunks : 48 > > nr_max_chunks : 3 nr_max_chunks : 48 > > min_alloc_size : 4 min_alloc_size : 4 > > max_alloc_size : 1072 max_alloc_size : 1072 > > empty_pop_pages : 12 empty_pop_pages : 61 > > > > > > So it looks like the number of chunks got bigger, as well as the number of > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make > > sure that the data is correct and we're not messing two cases? > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting > > freed completely. But I struggle to explain why the initial empty_pop_pages is > > bigger with the same amount of chunks. > > > > So, can you, please, apply the following patch and provide an updated statistics? > > Unfortunately, I'm not completely well versed in this area, but yes the empty > pop pages number doesn't make sense to me either. > > I re-ran the numbers trying to make sure my experiment setup is sane but > results remain the same. > > Vanilla > nr_alloc : 9040 nr_alloc : 97048 > nr_dealloc : 6994 nr_dealloc : 94404 > nr_cur_alloc : 2046 nr_cur_alloc : 2644 > nr_max_alloc : 2169 nr_max_alloc : 90054 > nr_chunks : 3 nr_chunks : 10 > nr_max_chunks : 3 nr_max_chunks : 47 > min_alloc_size : 4 min_alloc_size : 4 > max_alloc_size : 1072 max_alloc_size : 1072 > empty_pop_pages : 4 empty_pop_pages : 32 > > With the patchset + debug patch the results are as follows: > Patched > > nr_alloc : 9040 nr_alloc : 97048 > nr_dealloc : 6994 nr_dealloc : 94349 > nr_cur_alloc : 2046 nr_cur_alloc : 2699 > nr_max_alloc : 2194 nr_max_alloc : 90054 > nr_chunks : 3 nr_chunks : 48 > nr_max_chunks : 3 nr_max_chunks : 48 > min_alloc_size : 4 min_alloc_size : 4 > max_alloc_size : 1072 max_alloc_size : 1072 > empty_pop_pages : 12 empty_pop_pages : 54 > > With the extra tracing I can see 39 entries of "Chunk (sidelined)" > after the test was run. I don't see any entries for "Chunk (to depopulate)" > > I've snipped the results of slidelined chunks because they went on for ~600 > lines, if you need the full logs let me know. Yes, please! That's the most interesting part!
On 17/04/21 12:04 am, Roman Gushchin wrote: > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: >> >> On 16/04/21 10:43 pm, Roman Gushchin wrote: >>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: >>>> Hello Dennis, >>>> >>>> I apologize for the clutter of logs before, I'm pasting the logs of before and >>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and >>>> the vanilla kernel 5.12-rc6. >>>> >>>> On 16/04/21 7:48 pm, Dennis Zhou wrote: >>>>> Hello, >>>>> >>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: >>>>>> Hello Roman, >>>>>> >>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup. >>>>>> >>>>>> My results of the percpu_test are as follows: >>>>>> Intel KVM 4CPU:4G >>>>>> Vanilla 5.12-rc6 >>>>>> # ./percpu_test.sh >>>>>> Percpu: 1952 kB >>>>>> Percpu: 219648 kB >>>>>> Percpu: 219648 kB >>>>>> >>>>>> 5.12-rc6 + with patchset applied >>>>>> # ./percpu_test.sh >>>>>> Percpu: 2080 kB >>>>>> Percpu: 219712 kB >>>>>> Percpu: 72672 kB >>>>>> >>>>>> I'm able to see improvement comparable to that of what you're see too. >>>>>> >>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration >>>>>> >>>>>> POWER9 KVM 4CPU:4G >>>>>> Vanilla 5.12-rc6 >>>>>> # ./percpu_test.sh >>>>>> Percpu: 5888 kB >>>>>> Percpu: 118272 kB >>>>>> Percpu: 118272 kB >>>>>> >>>>>> 5.12-rc6 + with patchset applied >>>>>> # ./percpu_test.sh >>>>>> Percpu: 6144 kB >>>>>> Percpu: 119040 kB >>>>>> Percpu: 119040 kB >>>>>> >>>>>> I'm wondering if there's any architectural specific code that needs plumbing >>>>>> here? >>>>>> >>>>> There shouldn't be. Can you send me the percpu_stats debug output before >>>>> and after? >>>> I'll paste the whole debug stats before and after here. >>>> 5.12-rc6 + patchset >>>> -----BEFORE----- >>>> Percpu Memory Statistics >>>> Allocation Info: >>> Hm, this looks highly suspicious. Here is your stats in a more compact form: >>> >>> Vanilla >>> >>> nr_alloc : 9038 nr_alloc : 97046 >>> nr_dealloc : 6992 nr_dealloc : 94237 >>> nr_cur_alloc : 2046 nr_cur_alloc : 2809 >>> nr_max_alloc : 2178 nr_max_alloc : 90054 >>> nr_chunks : 3 nr_chunks : 11 >>> nr_max_chunks : 3 nr_max_chunks : 47 >>> min_alloc_size : 4 min_alloc_size : 4 >>> max_alloc_size : 1072 max_alloc_size : 1072 >>> empty_pop_pages : 5 empty_pop_pages : 29 >>> >>> >>> Patched >>> >>> nr_alloc : 9040 nr_alloc : 97048 >>> nr_dealloc : 6994 nr_dealloc : 95002 >>> nr_cur_alloc : 2046 nr_cur_alloc : 2046 >>> nr_max_alloc : 2208 nr_max_alloc : 90054 >>> nr_chunks : 3 nr_chunks : 48 >>> nr_max_chunks : 3 nr_max_chunks : 48 >>> min_alloc_size : 4 min_alloc_size : 4 >>> max_alloc_size : 1072 max_alloc_size : 1072 >>> empty_pop_pages : 12 empty_pop_pages : 61 >>> >>> >>> So it looks like the number of chunks got bigger, as well as the number of >>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make >>> sure that the data is correct and we're not messing two cases? >>> >>> So it looks like for some reason sidelined (depopulated) chunks are not getting >>> freed completely. But I struggle to explain why the initial empty_pop_pages is >>> bigger with the same amount of chunks. >>> >>> So, can you, please, apply the following patch and provide an updated statistics? >> Unfortunately, I'm not completely well versed in this area, but yes the empty >> pop pages number doesn't make sense to me either. >> >> I re-ran the numbers trying to make sure my experiment setup is sane but >> results remain the same. >> >> Vanilla >> nr_alloc : 9040 nr_alloc : 97048 >> nr_dealloc : 6994 nr_dealloc : 94404 >> nr_cur_alloc : 2046 nr_cur_alloc : 2644 >> nr_max_alloc : 2169 nr_max_alloc : 90054 >> nr_chunks : 3 nr_chunks : 10 >> nr_max_chunks : 3 nr_max_chunks : 47 >> min_alloc_size : 4 min_alloc_size : 4 >> max_alloc_size : 1072 max_alloc_size : 1072 >> empty_pop_pages : 4 empty_pop_pages : 32 >> >> With the patchset + debug patch the results are as follows: >> Patched >> >> nr_alloc : 9040 nr_alloc : 97048 >> nr_dealloc : 6994 nr_dealloc : 94349 >> nr_cur_alloc : 2046 nr_cur_alloc : 2699 >> nr_max_alloc : 2194 nr_max_alloc : 90054 >> nr_chunks : 3 nr_chunks : 48 >> nr_max_chunks : 3 nr_max_chunks : 48 >> min_alloc_size : 4 min_alloc_size : 4 >> max_alloc_size : 1072 max_alloc_size : 1072 >> empty_pop_pages : 12 empty_pop_pages : 54 >> >> With the extra tracing I can see 39 entries of "Chunk (sidelined)" >> after the test was run. I don't see any entries for "Chunk (to depopulate)" >> >> I've snipped the results of slidelined chunks because they went on for ~600 >> lines, if you need the full logs let me know. > Yes, please! That's the most interesting part! Got it. Pasting the full logs of after the percpu experiment was completed Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 97048 nr_dealloc : 94349 nr_cur_alloc : 2699 nr_max_alloc : 90054 nr_chunks : 48 nr_max_chunks : 48 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 54 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 1081 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16117 free_bytes : 4 contig_bytes : 4 sum_frag : 4 max_frag : 4 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 826 max_alloc_size : 1072 empty_pop_pages : 6 first_bit : 819 free_bytes : 640660 contig_bytes : 249896 sum_frag : 464700 max_frag : 306216 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 0 max_alloc_size : 0 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 0 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 536 free_bytes : 595752 contig_bytes : 26164 sum_frag : 575132 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 1072 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 597428 contig_bytes : 26164 sum_frag : 596848 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 92 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595284 contig_bytes : 26164 sum_frag : 590360 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 92 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595284 contig_bytes : 26164 sum_frag : 583768 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 7 first_bit : 26595 free_bytes : 506640 contig_bytes : 506540 sum_frag : 100 max_frag : 36 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 12 max_alloc_size : 1072 empty_pop_pages : 3 first_bit : 0 free_bytes : 647524 contig_bytes : 563492 sum_frag : 57872 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk (sidelined): nr_alloc : 52 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 621404 contig_bytes : 203104 sum_frag : 603400 max_frag : 260656 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk (sidelined): nr_alloc : 4 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 652748 contig_bytes : 570600 sum_frag : 570600 max_frag : 570600 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1
On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: > > > On 17/04/21 12:04 am, Roman Gushchin wrote: > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: > > > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote: > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: > > > > > Hello Dennis, > > > > > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and > > > > > the vanilla kernel 5.12-rc6. > > > > > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote: > > > > > > Hello, > > > > > > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > > > > > > Hello Roman, > > > > > > > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > > > > > > > > > > > My results of the percpu_test are as follows: > > > > > > > Intel KVM 4CPU:4G > > > > > > > Vanilla 5.12-rc6 > > > > > > > # ./percpu_test.sh > > > > > > > Percpu: 1952 kB > > > > > > > Percpu: 219648 kB > > > > > > > Percpu: 219648 kB > > > > > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > > > # ./percpu_test.sh > > > > > > > Percpu: 2080 kB > > > > > > > Percpu: 219712 kB > > > > > > > Percpu: 72672 kB > > > > > > > > > > > > > > I'm able to see improvement comparable to that of what you're see too. > > > > > > > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > > > > > > > > > > > > > POWER9 KVM 4CPU:4G > > > > > > > Vanilla 5.12-rc6 > > > > > > > # ./percpu_test.sh > > > > > > > Percpu: 5888 kB > > > > > > > Percpu: 118272 kB > > > > > > > Percpu: 118272 kB > > > > > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > > > # ./percpu_test.sh > > > > > > > Percpu: 6144 kB > > > > > > > Percpu: 119040 kB > > > > > > > Percpu: 119040 kB > > > > > > > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing > > > > > > > here? > > > > > > > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before > > > > > > and after? > > > > > I'll paste the whole debug stats before and after here. > > > > > 5.12-rc6 + patchset > > > > > -----BEFORE----- > > > > > Percpu Memory Statistics > > > > > Allocation Info: > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form: > > > > > > > > Vanilla > > > > > > > > nr_alloc : 9038 nr_alloc : 97046 > > > > nr_dealloc : 6992 nr_dealloc : 94237 > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2809 > > > > nr_max_alloc : 2178 nr_max_alloc : 90054 > > > > nr_chunks : 3 nr_chunks : 11 > > > > nr_max_chunks : 3 nr_max_chunks : 47 > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > empty_pop_pages : 5 empty_pop_pages : 29 > > > > > > > > > > > > Patched > > > > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > nr_dealloc : 6994 nr_dealloc : 95002 > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2046 > > > > nr_max_alloc : 2208 nr_max_alloc : 90054 > > > > nr_chunks : 3 nr_chunks : 48 > > > > nr_max_chunks : 3 nr_max_chunks : 48 > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > empty_pop_pages : 12 empty_pop_pages : 61 > > > > > > > > > > > > So it looks like the number of chunks got bigger, as well as the number of > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make > > > > sure that the data is correct and we're not messing two cases? > > > > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is > > > > bigger with the same amount of chunks. > > > > > > > > So, can you, please, apply the following patch and provide an updated statistics? > > > Unfortunately, I'm not completely well versed in this area, but yes the empty > > > pop pages number doesn't make sense to me either. > > > > > > I re-ran the numbers trying to make sure my experiment setup is sane but > > > results remain the same. > > > > > > Vanilla > > > nr_alloc : 9040 nr_alloc : 97048 > > > nr_dealloc : 6994 nr_dealloc : 94404 > > > nr_cur_alloc : 2046 nr_cur_alloc : 2644 > > > nr_max_alloc : 2169 nr_max_alloc : 90054 > > > nr_chunks : 3 nr_chunks : 10 > > > nr_max_chunks : 3 nr_max_chunks : 47 > > > min_alloc_size : 4 min_alloc_size : 4 > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > empty_pop_pages : 4 empty_pop_pages : 32 > > > > > > With the patchset + debug patch the results are as follows: > > > Patched > > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > nr_dealloc : 6994 nr_dealloc : 94349 > > > nr_cur_alloc : 2046 nr_cur_alloc : 2699 > > > nr_max_alloc : 2194 nr_max_alloc : 90054 > > > nr_chunks : 3 nr_chunks : 48 > > > nr_max_chunks : 3 nr_max_chunks : 48 > > > min_alloc_size : 4 min_alloc_size : 4 > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > empty_pop_pages : 12 empty_pop_pages : 54 > > > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)" > > > after the test was run. I don't see any entries for "Chunk (to depopulate)" > > > > > > I've snipped the results of slidelined chunks because they went on for ~600 > > > lines, if you need the full logs let me know. > > Yes, please! That's the most interesting part! > > Got it. Pasting the full logs of after the percpu experiment was completed Thanks! Would you mind to apply the following patch and test again? -- diff --git a/mm/percpu.c b/mm/percpu.c index ded3a7541cb2..532c6a7ebdfd 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr) need_balance = true; break; } + + chunk->depopulated = false; + pcpu_chunk_relocate(chunk, -1); } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && !chunk->isolated && (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
On 17/04/21 12:39 am, Roman Gushchin wrote: > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: >> >> On 17/04/21 12:04 am, Roman Gushchin wrote: >>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: >>>> On 16/04/21 10:43 pm, Roman Gushchin wrote: >>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: >>>>>> Hello Dennis, >>>>>> >>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and >>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and >>>>>> the vanilla kernel 5.12-rc6. >>>>>> >>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote: >>>>>>> Hello, >>>>>>> >>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: >>>>>>>> Hello Roman, >>>>>>>> >>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup. >>>>>>>> >>>>>>>> My results of the percpu_test are as follows: >>>>>>>> Intel KVM 4CPU:4G >>>>>>>> Vanilla 5.12-rc6 >>>>>>>> # ./percpu_test.sh >>>>>>>> Percpu: 1952 kB >>>>>>>> Percpu: 219648 kB >>>>>>>> Percpu: 219648 kB >>>>>>>> >>>>>>>> 5.12-rc6 + with patchset applied >>>>>>>> # ./percpu_test.sh >>>>>>>> Percpu: 2080 kB >>>>>>>> Percpu: 219712 kB >>>>>>>> Percpu: 72672 kB >>>>>>>> >>>>>>>> I'm able to see improvement comparable to that of what you're see too. >>>>>>>> >>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration >>>>>>>> >>>>>>>> POWER9 KVM 4CPU:4G >>>>>>>> Vanilla 5.12-rc6 >>>>>>>> # ./percpu_test.sh >>>>>>>> Percpu: 5888 kB >>>>>>>> Percpu: 118272 kB >>>>>>>> Percpu: 118272 kB >>>>>>>> >>>>>>>> 5.12-rc6 + with patchset applied >>>>>>>> # ./percpu_test.sh >>>>>>>> Percpu: 6144 kB >>>>>>>> Percpu: 119040 kB >>>>>>>> Percpu: 119040 kB >>>>>>>> >>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing >>>>>>>> here? >>>>>>>> >>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before >>>>>>> and after? >>>>>> I'll paste the whole debug stats before and after here. >>>>>> 5.12-rc6 + patchset >>>>>> -----BEFORE----- >>>>>> Percpu Memory Statistics >>>>>> Allocation Info: >>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form: >>>>> >>>>> Vanilla >>>>> >>>>> nr_alloc : 9038 nr_alloc : 97046 >>>>> nr_dealloc : 6992 nr_dealloc : 94237 >>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809 >>>>> nr_max_alloc : 2178 nr_max_alloc : 90054 >>>>> nr_chunks : 3 nr_chunks : 11 >>>>> nr_max_chunks : 3 nr_max_chunks : 47 >>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>> empty_pop_pages : 5 empty_pop_pages : 29 >>>>> >>>>> >>>>> Patched >>>>> >>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>> nr_dealloc : 6994 nr_dealloc : 95002 >>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046 >>>>> nr_max_alloc : 2208 nr_max_alloc : 90054 >>>>> nr_chunks : 3 nr_chunks : 48 >>>>> nr_max_chunks : 3 nr_max_chunks : 48 >>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>> empty_pop_pages : 12 empty_pop_pages : 61 >>>>> >>>>> >>>>> So it looks like the number of chunks got bigger, as well as the number of >>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make >>>>> sure that the data is correct and we're not messing two cases? >>>>> >>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting >>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is >>>>> bigger with the same amount of chunks. >>>>> >>>>> So, can you, please, apply the following patch and provide an updated statistics? >>>> Unfortunately, I'm not completely well versed in this area, but yes the empty >>>> pop pages number doesn't make sense to me either. >>>> >>>> I re-ran the numbers trying to make sure my experiment setup is sane but >>>> results remain the same. >>>> >>>> Vanilla >>>> nr_alloc : 9040 nr_alloc : 97048 >>>> nr_dealloc : 6994 nr_dealloc : 94404 >>>> nr_cur_alloc : 2046 nr_cur_alloc : 2644 >>>> nr_max_alloc : 2169 nr_max_alloc : 90054 >>>> nr_chunks : 3 nr_chunks : 10 >>>> nr_max_chunks : 3 nr_max_chunks : 47 >>>> min_alloc_size : 4 min_alloc_size : 4 >>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>> empty_pop_pages : 4 empty_pop_pages : 32 >>>> >>>> With the patchset + debug patch the results are as follows: >>>> Patched >>>> >>>> nr_alloc : 9040 nr_alloc : 97048 >>>> nr_dealloc : 6994 nr_dealloc : 94349 >>>> nr_cur_alloc : 2046 nr_cur_alloc : 2699 >>>> nr_max_alloc : 2194 nr_max_alloc : 90054 >>>> nr_chunks : 3 nr_chunks : 48 >>>> nr_max_chunks : 3 nr_max_chunks : 48 >>>> min_alloc_size : 4 min_alloc_size : 4 >>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>> empty_pop_pages : 12 empty_pop_pages : 54 >>>> >>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)" >>>> after the test was run. I don't see any entries for "Chunk (to depopulate)" >>>> >>>> I've snipped the results of slidelined chunks because they went on for ~600 >>>> lines, if you need the full logs let me know. >>> Yes, please! That's the most interesting part! >> Got it. Pasting the full logs of after the percpu experiment was completed > Thanks! > > Would you mind to apply the following patch and test again? > > -- > > diff --git a/mm/percpu.c b/mm/percpu.c > index ded3a7541cb2..532c6a7ebdfd 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr) > need_balance = true; > break; > } > + > + chunk->depopulated = false; > + pcpu_chunk_relocate(chunk, -1); > } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && > !chunk->isolated && > (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] > > Sure thing. I see much lower sideline chunks. In one such test run I saw zero occurrences of slidelined chunks Pasting the full logs as an example: BEFORE Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 9038 nr_dealloc : 6992 nr_cur_alloc : 2046 nr_max_alloc : 2200 nr_chunks : 3 nr_max_chunks : 3 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 12 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 1092 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16247 free_bytes : 4 contig_bytes : 4 sum_frag : 4 max_frag : 4 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 594 max_alloc_size : 992 empty_pop_pages : 8 first_bit : 456 free_bytes : 645008 contig_bytes : 319984 sum_frag : 325024 max_frag : 318680 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 424 memcg_aware : 0 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 4 first_bit : 26595 free_bytes : 506640 contig_bytes : 506540 sum_frag : 100 max_frag : 32 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 AFTER Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 655360 static_size : 608920 reserved_size : 0 dyn_size : 46440 atom_size : 65536 alloc_size : 655360 Global Stats: ---------------------------------------- nr_alloc : 97046 nr_dealloc : 94304 nr_cur_alloc : 2742 nr_max_alloc : 90054 nr_chunks : 11 nr_max_chunks : 47 min_alloc_size : 4 max_alloc_size : 1072 empty_pop_pages : 18 Per Chunk Stats: ---------------------------------------- Chunk: <- First Chunk nr_alloc : 1092 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 16247 free_bytes : 4 contig_bytes : 4 sum_frag : 4 max_frag : 4 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 838 max_alloc_size : 1072 empty_pop_pages : 7 first_bit : 464 free_bytes : 640476 contig_bytes : 290672 sum_frag : 349804 max_frag : 304344 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 1072 memcg_aware : 0 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 536 free_bytes : 595752 contig_bytes : 26164 sum_frag : 575132 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 1072 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 90 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 597428 contig_bytes : 26164 sum_frag : 596848 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 92 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595284 contig_bytes : 26164 sum_frag : 590360 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 92 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 595284 contig_bytes : 26164 sum_frag : 583768 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 360 max_alloc_size : 1072 empty_pop_pages : 7 first_bit : 26595 free_bytes : 506640 contig_bytes : 506540 sum_frag : 100 max_frag : 32 cur_min_alloc : 4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 12 max_alloc_size : 1072 empty_pop_pages : 3 first_bit : 0 free_bytes : 647524 contig_bytes : 563492 sum_frag : 57872 max_frag : 26164 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk: nr_alloc : 0 max_alloc_size : 1072 empty_pop_pages : 1 first_bit : 0 free_bytes : 655360 contig_bytes : 655360 sum_frag : 0 max_frag : 0 cur_min_alloc : 0 cur_med_alloc : 0 cur_max_alloc : 0 memcg_aware : 1 Chunk (sidelined): nr_alloc : 72 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 608344 contig_bytes : 145552 sum_frag : 590340 max_frag : 145552 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1 Chunk (sidelined): nr_alloc : 4 max_alloc_size : 1072 empty_pop_pages : 0 first_bit : 0 free_bytes : 652748 contig_bytes : 426720 sum_frag : 426720 max_frag : 426720 cur_min_alloc : 156 cur_med_alloc : 312 cur_max_alloc : 1072 memcg_aware : 1
On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote: > > > On 17/04/21 12:39 am, Roman Gushchin wrote: > > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: > > > > > > On 17/04/21 12:04 am, Roman Gushchin wrote: > > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: > > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote: > > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: > > > > > > > Hello Dennis, > > > > > > > > > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and > > > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and > > > > > > > the vanilla kernel 5.12-rc6. > > > > > > > > > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > > > > > > > > Hello Roman, > > > > > > > > > > > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > > > > > > > > > > > > > > > My results of the percpu_test are as follows: > > > > > > > > > Intel KVM 4CPU:4G > > > > > > > > > Vanilla 5.12-rc6 > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 1952 kB > > > > > > > > > Percpu: 219648 kB > > > > > > > > > Percpu: 219648 kB > > > > > > > > > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 2080 kB > > > > > > > > > Percpu: 219712 kB > > > > > > > > > Percpu: 72672 kB > > > > > > > > > > > > > > > > > > I'm able to see improvement comparable to that of what you're see too. > > > > > > > > > > > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > > > > > > > > > > > > > > > > > POWER9 KVM 4CPU:4G > > > > > > > > > Vanilla 5.12-rc6 > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 5888 kB > > > > > > > > > Percpu: 118272 kB > > > > > > > > > Percpu: 118272 kB > > > > > > > > > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 6144 kB > > > > > > > > > Percpu: 119040 kB > > > > > > > > > Percpu: 119040 kB > > > > > > > > > > > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing > > > > > > > > > here? > > > > > > > > > > > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before > > > > > > > > and after? > > > > > > > I'll paste the whole debug stats before and after here. > > > > > > > 5.12-rc6 + patchset > > > > > > > -----BEFORE----- > > > > > > > Percpu Memory Statistics > > > > > > > Allocation Info: > > > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form: > > > > > > > > > > > > Vanilla > > > > > > > > > > > > nr_alloc : 9038 nr_alloc : 97046 > > > > > > nr_dealloc : 6992 nr_dealloc : 94237 > > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2809 > > > > > > nr_max_alloc : 2178 nr_max_alloc : 90054 > > > > > > nr_chunks : 3 nr_chunks : 11 > > > > > > nr_max_chunks : 3 nr_max_chunks : 47 > > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > > empty_pop_pages : 5 empty_pop_pages : 29 > > > > > > > > > > > > > > > > > > Patched > > > > > > > > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > > > nr_dealloc : 6994 nr_dealloc : 95002 > > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2046 > > > > > > nr_max_alloc : 2208 nr_max_alloc : 90054 > > > > > > nr_chunks : 3 nr_chunks : 48 > > > > > > nr_max_chunks : 3 nr_max_chunks : 48 > > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > > empty_pop_pages : 12 empty_pop_pages : 61 > > > > > > > > > > > > > > > > > > So it looks like the number of chunks got bigger, as well as the number of > > > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make > > > > > > sure that the data is correct and we're not messing two cases? > > > > > > > > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting > > > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is > > > > > > bigger with the same amount of chunks. > > > > > > > > > > > > So, can you, please, apply the following patch and provide an updated statistics? > > > > > Unfortunately, I'm not completely well versed in this area, but yes the empty > > > > > pop pages number doesn't make sense to me either. > > > > > > > > > > I re-ran the numbers trying to make sure my experiment setup is sane but > > > > > results remain the same. > > > > > > > > > > Vanilla > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > > nr_dealloc : 6994 nr_dealloc : 94404 > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2644 > > > > > nr_max_alloc : 2169 nr_max_alloc : 90054 > > > > > nr_chunks : 3 nr_chunks : 10 > > > > > nr_max_chunks : 3 nr_max_chunks : 47 > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > empty_pop_pages : 4 empty_pop_pages : 32 > > > > > > > > > > With the patchset + debug patch the results are as follows: > > > > > Patched > > > > > > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > > nr_dealloc : 6994 nr_dealloc : 94349 > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2699 > > > > > nr_max_alloc : 2194 nr_max_alloc : 90054 > > > > > nr_chunks : 3 nr_chunks : 48 > > > > > nr_max_chunks : 3 nr_max_chunks : 48 > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > empty_pop_pages : 12 empty_pop_pages : 54 > > > > > > > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)" > > > > > after the test was run. I don't see any entries for "Chunk (to depopulate)" > > > > > > > > > > I've snipped the results of slidelined chunks because they went on for ~600 > > > > > lines, if you need the full logs let me know. > > > > Yes, please! That's the most interesting part! > > > Got it. Pasting the full logs of after the percpu experiment was completed > > Thanks! > > > > Would you mind to apply the following patch and test again? > > > > -- > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > index ded3a7541cb2..532c6a7ebdfd 100644 > > --- a/mm/percpu.c > > +++ b/mm/percpu.c > > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr) > > need_balance = true; > > break; > > } > > + > > + chunk->depopulated = false; > > + pcpu_chunk_relocate(chunk, -1); > > } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && > > !chunk->isolated && > > (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] > > > > Sure thing. > > I see much lower sideline chunks. In one such test run I saw zero occurrences > of slidelined chunks > So looking at the stats it now works properly. Do you see any savings in comparison to vanilla? The size of savings can significanlty depend on the exact size of cgroup-related objects, how many of them fit into a single chunk, etc. So you might want to play with numbers in the test... Anyway, thank you very much for the report and your work on testing follow-up patches! It helped to reveal a serious bug in the implementation (completely empty sidelined chunks were not released in some cases), which by pure coincidence wasn't triggered on x86. Thanks!
Hello, On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote: > > > On 17/04/21 12:39 am, Roman Gushchin wrote: > > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: > > > > > > On 17/04/21 12:04 am, Roman Gushchin wrote: > > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: > > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote: > > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: > > > > > > > Hello Dennis, > > > > > > > > > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and > > > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and > > > > > > > the vanilla kernel 5.12-rc6. > > > > > > > > > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: > > > > > > > > > Hello Roman, > > > > > > > > > > > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup. > > > > > > > > > > > > > > > > > > My results of the percpu_test are as follows: > > > > > > > > > Intel KVM 4CPU:4G > > > > > > > > > Vanilla 5.12-rc6 > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 1952 kB > > > > > > > > > Percpu: 219648 kB > > > > > > > > > Percpu: 219648 kB > > > > > > > > > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 2080 kB > > > > > > > > > Percpu: 219712 kB > > > > > > > > > Percpu: 72672 kB > > > > > > > > > > > > > > > > > > I'm able to see improvement comparable to that of what you're see too. > > > > > > > > > > > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration > > > > > > > > > > > > > > > > > > POWER9 KVM 4CPU:4G > > > > > > > > > Vanilla 5.12-rc6 > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 5888 kB > > > > > > > > > Percpu: 118272 kB > > > > > > > > > Percpu: 118272 kB > > > > > > > > > > > > > > > > > > 5.12-rc6 + with patchset applied > > > > > > > > > # ./percpu_test.sh > > > > > > > > > Percpu: 6144 kB > > > > > > > > > Percpu: 119040 kB > > > > > > > > > Percpu: 119040 kB > > > > > > > > > > > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing > > > > > > > > > here? > > > > > > > > > > > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before > > > > > > > > and after? > > > > > > > I'll paste the whole debug stats before and after here. > > > > > > > 5.12-rc6 + patchset > > > > > > > -----BEFORE----- > > > > > > > Percpu Memory Statistics > > > > > > > Allocation Info: > > > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form: > > > > > > > > > > > > Vanilla > > > > > > > > > > > > nr_alloc : 9038 nr_alloc : 97046 > > > > > > nr_dealloc : 6992 nr_dealloc : 94237 > > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2809 > > > > > > nr_max_alloc : 2178 nr_max_alloc : 90054 > > > > > > nr_chunks : 3 nr_chunks : 11 > > > > > > nr_max_chunks : 3 nr_max_chunks : 47 > > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > > empty_pop_pages : 5 empty_pop_pages : 29 > > > > > > > > > > > > > > > > > > Patched > > > > > > > > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > > > nr_dealloc : 6994 nr_dealloc : 95002 > > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2046 > > > > > > nr_max_alloc : 2208 nr_max_alloc : 90054 > > > > > > nr_chunks : 3 nr_chunks : 48 > > > > > > nr_max_chunks : 3 nr_max_chunks : 48 > > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > > empty_pop_pages : 12 empty_pop_pages : 61 > > > > > > > > > > > > > > > > > > So it looks like the number of chunks got bigger, as well as the number of > > > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make > > > > > > sure that the data is correct and we're not messing two cases? > > > > > > > > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting > > > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is > > > > > > bigger with the same amount of chunks. > > > > > > > > > > > > So, can you, please, apply the following patch and provide an updated statistics? > > > > > Unfortunately, I'm not completely well versed in this area, but yes the empty > > > > > pop pages number doesn't make sense to me either. > > > > > > > > > > I re-ran the numbers trying to make sure my experiment setup is sane but > > > > > results remain the same. > > > > > > > > > > Vanilla > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > > nr_dealloc : 6994 nr_dealloc : 94404 > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2644 > > > > > nr_max_alloc : 2169 nr_max_alloc : 90054 > > > > > nr_chunks : 3 nr_chunks : 10 > > > > > nr_max_chunks : 3 nr_max_chunks : 47 > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > empty_pop_pages : 4 empty_pop_pages : 32 > > > > > > > > > > With the patchset + debug patch the results are as follows: > > > > > Patched > > > > > > > > > > nr_alloc : 9040 nr_alloc : 97048 > > > > > nr_dealloc : 6994 nr_dealloc : 94349 > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2699 > > > > > nr_max_alloc : 2194 nr_max_alloc : 90054 > > > > > nr_chunks : 3 nr_chunks : 48 > > > > > nr_max_chunks : 3 nr_max_chunks : 48 > > > > > min_alloc_size : 4 min_alloc_size : 4 > > > > > max_alloc_size : 1072 max_alloc_size : 1072 > > > > > empty_pop_pages : 12 empty_pop_pages : 54 > > > > > > > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)" > > > > > after the test was run. I don't see any entries for "Chunk (to depopulate)" > > > > > > > > > > I've snipped the results of slidelined chunks because they went on for ~600 > > > > > lines, if you need the full logs let me know. > > > > Yes, please! That's the most interesting part! > > > Got it. Pasting the full logs of after the percpu experiment was completed > > Thanks! > > > > Would you mind to apply the following patch and test again? > > > > -- > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > index ded3a7541cb2..532c6a7ebdfd 100644 > > --- a/mm/percpu.c > > +++ b/mm/percpu.c > > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr) > > need_balance = true; > > break; > > } > > + > > + chunk->depopulated = false; > > + pcpu_chunk_relocate(chunk, -1); > > } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && > > !chunk->isolated && > > (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] > > > > Sure thing. > > I see much lower sideline chunks. In one such test run I saw zero occurrences > of slidelined chunks > > Pasting the full logs as an example: > > BEFORE > Percpu Memory Statistics > Allocation Info: > ---------------------------------------- > unit_size : 655360 > static_size : 608920 > reserved_size : 0 > dyn_size : 46440 > atom_size : 65536 > alloc_size : 655360 > > Global Stats: > ---------------------------------------- > nr_alloc : 9038 > nr_dealloc : 6992 > nr_cur_alloc : 2046 > nr_max_alloc : 2200 > nr_chunks : 3 > nr_max_chunks : 3 > min_alloc_size : 4 > max_alloc_size : 1072 > empty_pop_pages : 12 > > Per Chunk Stats: > ---------------------------------------- > Chunk: <- First Chunk > nr_alloc : 1092 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 16247 > free_bytes : 4 > contig_bytes : 4 > sum_frag : 4 > max_frag : 4 > cur_min_alloc : 4 > cur_med_alloc : 8 > cur_max_alloc : 1072 > memcg_aware : 0 > > Chunk: > nr_alloc : 594 > max_alloc_size : 992 > empty_pop_pages : 8 > first_bit : 456 > free_bytes : 645008 > contig_bytes : 319984 > sum_frag : 325024 > max_frag : 318680 > cur_min_alloc : 4 > cur_med_alloc : 8 > cur_max_alloc : 424 > memcg_aware : 0 > > Chunk: > nr_alloc : 360 > max_alloc_size : 1072 > empty_pop_pages : 4 > first_bit : 26595 > free_bytes : 506640 > contig_bytes : 506540 > sum_frag : 100 > max_frag : 32 > cur_min_alloc : 4 > cur_med_alloc : 156 > cur_max_alloc : 1072 > memcg_aware : 1 > > > AFTER > Percpu Memory Statistics > Allocation Info: > ---------------------------------------- > unit_size : 655360 > static_size : 608920 > reserved_size : 0 > dyn_size : 46440 > atom_size : 65536 > alloc_size : 655360 > > Global Stats: > ---------------------------------------- > nr_alloc : 97046 > nr_dealloc : 94304 > nr_cur_alloc : 2742 > nr_max_alloc : 90054 > nr_chunks : 11 > nr_max_chunks : 47 > min_alloc_size : 4 > max_alloc_size : 1072 > empty_pop_pages : 18 > > Per Chunk Stats: > ---------------------------------------- > Chunk: <- First Chunk > nr_alloc : 1092 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 16247 > free_bytes : 4 > contig_bytes : 4 > sum_frag : 4 > max_frag : 4 > cur_min_alloc : 4 > cur_med_alloc : 8 > cur_max_alloc : 1072 > memcg_aware : 0 > > Chunk: > nr_alloc : 838 > max_alloc_size : 1072 > empty_pop_pages : 7 > first_bit : 464 > free_bytes : 640476 > contig_bytes : 290672 > sum_frag : 349804 > max_frag : 304344 > cur_min_alloc : 4 > cur_med_alloc : 8 > cur_max_alloc : 1072 > memcg_aware : 0 > > Chunk: > nr_alloc : 90 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 536 > free_bytes : 595752 > contig_bytes : 26164 > sum_frag : 575132 > max_frag : 26164 > cur_min_alloc : 156 > cur_med_alloc : 1072 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk: > nr_alloc : 90 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 0 > free_bytes : 597428 > contig_bytes : 26164 > sum_frag : 596848 > max_frag : 26164 > cur_min_alloc : 156 > cur_med_alloc : 312 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk: > nr_alloc : 92 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 0 > free_bytes : 595284 > contig_bytes : 26164 > sum_frag : 590360 > max_frag : 26164 > cur_min_alloc : 156 > cur_med_alloc : 312 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk: > nr_alloc : 92 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 0 > free_bytes : 595284 > contig_bytes : 26164 > sum_frag : 583768 > max_frag : 26164 > cur_min_alloc : 156 > cur_med_alloc : 312 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk: > nr_alloc : 360 > max_alloc_size : 1072 > empty_pop_pages : 7 > first_bit : 26595 > free_bytes : 506640 > contig_bytes : 506540 > sum_frag : 100 > max_frag : 32 > cur_min_alloc : 4 > cur_med_alloc : 156 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk: > nr_alloc : 12 > max_alloc_size : 1072 > empty_pop_pages : 3 > first_bit : 0 > free_bytes : 647524 > contig_bytes : 563492 > sum_frag : 57872 > max_frag : 26164 > cur_min_alloc : 156 > cur_med_alloc : 312 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk: > nr_alloc : 0 > max_alloc_size : 1072 > empty_pop_pages : 1 > first_bit : 0 > free_bytes : 655360 > contig_bytes : 655360 > sum_frag : 0 > max_frag : 0 > cur_min_alloc : 0 > cur_med_alloc : 0 > cur_max_alloc : 0 > memcg_aware : 1 > > Chunk (sidelined): > nr_alloc : 72 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 0 > free_bytes : 608344 > contig_bytes : 145552 > sum_frag : 590340 > max_frag : 145552 > cur_min_alloc : 156 > cur_med_alloc : 312 > cur_max_alloc : 1072 > memcg_aware : 1 > > Chunk (sidelined): > nr_alloc : 4 > max_alloc_size : 1072 > empty_pop_pages : 0 > first_bit : 0 > free_bytes : 652748 > contig_bytes : 426720 > sum_frag : 426720 > max_frag : 426720 > cur_min_alloc : 156 > cur_med_alloc : 312 > cur_max_alloc : 1072 > memcg_aware : 1 > > Thank you Pratik for testing this and working with us to resolve this. I greatly appreciate it! Thanks, Dennis
On 17/04/21 1:33 am, Roman Gushchin wrote: > On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote: >> >> On 17/04/21 12:39 am, Roman Gushchin wrote: >>> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: >>>> On 17/04/21 12:04 am, Roman Gushchin wrote: >>>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: >>>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote: >>>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: >>>>>>>> Hello Dennis, >>>>>>>> >>>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and >>>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and >>>>>>>> the vanilla kernel 5.12-rc6. >>>>>>>> >>>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: >>>>>>>>>> Hello Roman, >>>>>>>>>> >>>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup. >>>>>>>>>> >>>>>>>>>> My results of the percpu_test are as follows: >>>>>>>>>> Intel KVM 4CPU:4G >>>>>>>>>> Vanilla 5.12-rc6 >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 1952 kB >>>>>>>>>> Percpu: 219648 kB >>>>>>>>>> Percpu: 219648 kB >>>>>>>>>> >>>>>>>>>> 5.12-rc6 + with patchset applied >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 2080 kB >>>>>>>>>> Percpu: 219712 kB >>>>>>>>>> Percpu: 72672 kB >>>>>>>>>> >>>>>>>>>> I'm able to see improvement comparable to that of what you're see too. >>>>>>>>>> >>>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration >>>>>>>>>> >>>>>>>>>> POWER9 KVM 4CPU:4G >>>>>>>>>> Vanilla 5.12-rc6 >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 5888 kB >>>>>>>>>> Percpu: 118272 kB >>>>>>>>>> Percpu: 118272 kB >>>>>>>>>> >>>>>>>>>> 5.12-rc6 + with patchset applied >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 6144 kB >>>>>>>>>> Percpu: 119040 kB >>>>>>>>>> Percpu: 119040 kB >>>>>>>>>> >>>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing >>>>>>>>>> here? >>>>>>>>>> >>>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before >>>>>>>>> and after? >>>>>>>> I'll paste the whole debug stats before and after here. >>>>>>>> 5.12-rc6 + patchset >>>>>>>> -----BEFORE----- >>>>>>>> Percpu Memory Statistics >>>>>>>> Allocation Info: >>>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form: >>>>>>> >>>>>>> Vanilla >>>>>>> >>>>>>> nr_alloc : 9038 nr_alloc : 97046 >>>>>>> nr_dealloc : 6992 nr_dealloc : 94237 >>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809 >>>>>>> nr_max_alloc : 2178 nr_max_alloc : 90054 >>>>>>> nr_chunks : 3 nr_chunks : 11 >>>>>>> nr_max_chunks : 3 nr_max_chunks : 47 >>>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>>> empty_pop_pages : 5 empty_pop_pages : 29 >>>>>>> >>>>>>> >>>>>>> Patched >>>>>>> >>>>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>>>> nr_dealloc : 6994 nr_dealloc : 95002 >>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046 >>>>>>> nr_max_alloc : 2208 nr_max_alloc : 90054 >>>>>>> nr_chunks : 3 nr_chunks : 48 >>>>>>> nr_max_chunks : 3 nr_max_chunks : 48 >>>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>>> empty_pop_pages : 12 empty_pop_pages : 61 >>>>>>> >>>>>>> >>>>>>> So it looks like the number of chunks got bigger, as well as the number of >>>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make >>>>>>> sure that the data is correct and we're not messing two cases? >>>>>>> >>>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting >>>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is >>>>>>> bigger with the same amount of chunks. >>>>>>> >>>>>>> So, can you, please, apply the following patch and provide an updated statistics? >>>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty >>>>>> pop pages number doesn't make sense to me either. >>>>>> >>>>>> I re-ran the numbers trying to make sure my experiment setup is sane but >>>>>> results remain the same. >>>>>> >>>>>> Vanilla >>>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>>> nr_dealloc : 6994 nr_dealloc : 94404 >>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2644 >>>>>> nr_max_alloc : 2169 nr_max_alloc : 90054 >>>>>> nr_chunks : 3 nr_chunks : 10 >>>>>> nr_max_chunks : 3 nr_max_chunks : 47 >>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>> empty_pop_pages : 4 empty_pop_pages : 32 >>>>>> >>>>>> With the patchset + debug patch the results are as follows: >>>>>> Patched >>>>>> >>>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>>> nr_dealloc : 6994 nr_dealloc : 94349 >>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2699 >>>>>> nr_max_alloc : 2194 nr_max_alloc : 90054 >>>>>> nr_chunks : 3 nr_chunks : 48 >>>>>> nr_max_chunks : 3 nr_max_chunks : 48 >>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>> empty_pop_pages : 12 empty_pop_pages : 54 >>>>>> >>>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)" >>>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)" >>>>>> >>>>>> I've snipped the results of slidelined chunks because they went on for ~600 >>>>>> lines, if you need the full logs let me know. >>>>> Yes, please! That's the most interesting part! >>>> Got it. Pasting the full logs of after the percpu experiment was completed >>> Thanks! >>> >>> Would you mind to apply the following patch and test again? >>> >>> -- >>> >>> diff --git a/mm/percpu.c b/mm/percpu.c >>> index ded3a7541cb2..532c6a7ebdfd 100644 >>> --- a/mm/percpu.c >>> +++ b/mm/percpu.c >>> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr) >>> need_balance = true; >>> break; >>> } >>> + >>> + chunk->depopulated = false; >>> + pcpu_chunk_relocate(chunk, -1); >>> } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && >>> !chunk->isolated && >>> (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] > >>> >> Sure thing. >> >> I see much lower sideline chunks. In one such test run I saw zero occurrences >> of slidelined chunks >> > So looking at the stats it now works properly. Do you see any savings in > comparison to vanilla? The size of savings can significanlty depend on the exact > size of cgroup-related objects, how many of them fit into a single chunk, etc. > So you might want to play with numbers in the test... > > Anyway, thank you very much for the report and your work on testing follow-up > patches! It helped to reveal a serious bug in the implementation (completely > empty sidelined chunks were not released in some cases), which by pure > coincidence wasn't triggered on x86. > > Thanks! > Unfortunately not, I don't see any savings from the test. # ./percpu_test_roman.sh Percpu: 6144 kB Percpu: 122880 kB Percpu: 122880 kB I had assumed that because POWER has a larger page size, we would indeed also have higher fragmentation which could possibly lead to a lot more savings. I'll dive deeper into the patches and tweak around the setup to see if I can understand this behavior. Thanks for helping me understand this patchset a little better and I'm glad we found a bug with sidelined chunks! I'll get back to you if I do find something interesting and need help understanding it. Thank you again, Pratik
On 17/04/21 3:17 am, Dennis Zhou wrote: > Hello, > > On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote: >> >> On 17/04/21 12:39 am, Roman Gushchin wrote: >>> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: >>>> On 17/04/21 12:04 am, Roman Gushchin wrote: >>>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: >>>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote: >>>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: >>>>>>>> Hello Dennis, >>>>>>>> >>>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and >>>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and >>>>>>>> the vanilla kernel 5.12-rc6. >>>>>>>> >>>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: >>>>>>>>>> Hello Roman, >>>>>>>>>> >>>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup. >>>>>>>>>> >>>>>>>>>> My results of the percpu_test are as follows: >>>>>>>>>> Intel KVM 4CPU:4G >>>>>>>>>> Vanilla 5.12-rc6 >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 1952 kB >>>>>>>>>> Percpu: 219648 kB >>>>>>>>>> Percpu: 219648 kB >>>>>>>>>> >>>>>>>>>> 5.12-rc6 + with patchset applied >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 2080 kB >>>>>>>>>> Percpu: 219712 kB >>>>>>>>>> Percpu: 72672 kB >>>>>>>>>> >>>>>>>>>> I'm able to see improvement comparable to that of what you're see too. >>>>>>>>>> >>>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration >>>>>>>>>> >>>>>>>>>> POWER9 KVM 4CPU:4G >>>>>>>>>> Vanilla 5.12-rc6 >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 5888 kB >>>>>>>>>> Percpu: 118272 kB >>>>>>>>>> Percpu: 118272 kB >>>>>>>>>> >>>>>>>>>> 5.12-rc6 + with patchset applied >>>>>>>>>> # ./percpu_test.sh >>>>>>>>>> Percpu: 6144 kB >>>>>>>>>> Percpu: 119040 kB >>>>>>>>>> Percpu: 119040 kB >>>>>>>>>> >>>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing >>>>>>>>>> here? >>>>>>>>>> >>>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before >>>>>>>>> and after? >>>>>>>> I'll paste the whole debug stats before and after here. >>>>>>>> 5.12-rc6 + patchset >>>>>>>> -----BEFORE----- >>>>>>>> Percpu Memory Statistics >>>>>>>> Allocation Info: >>>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form: >>>>>>> >>>>>>> Vanilla >>>>>>> >>>>>>> nr_alloc : 9038 nr_alloc : 97046 >>>>>>> nr_dealloc : 6992 nr_dealloc : 94237 >>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809 >>>>>>> nr_max_alloc : 2178 nr_max_alloc : 90054 >>>>>>> nr_chunks : 3 nr_chunks : 11 >>>>>>> nr_max_chunks : 3 nr_max_chunks : 47 >>>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>>> empty_pop_pages : 5 empty_pop_pages : 29 >>>>>>> >>>>>>> >>>>>>> Patched >>>>>>> >>>>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>>>> nr_dealloc : 6994 nr_dealloc : 95002 >>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046 >>>>>>> nr_max_alloc : 2208 nr_max_alloc : 90054 >>>>>>> nr_chunks : 3 nr_chunks : 48 >>>>>>> nr_max_chunks : 3 nr_max_chunks : 48 >>>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>>> empty_pop_pages : 12 empty_pop_pages : 61 >>>>>>> >>>>>>> >>>>>>> So it looks like the number of chunks got bigger, as well as the number of >>>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make >>>>>>> sure that the data is correct and we're not messing two cases? >>>>>>> >>>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting >>>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is >>>>>>> bigger with the same amount of chunks. >>>>>>> >>>>>>> So, can you, please, apply the following patch and provide an updated statistics? >>>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty >>>>>> pop pages number doesn't make sense to me either. >>>>>> >>>>>> I re-ran the numbers trying to make sure my experiment setup is sane but >>>>>> results remain the same. >>>>>> >>>>>> Vanilla >>>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>>> nr_dealloc : 6994 nr_dealloc : 94404 >>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2644 >>>>>> nr_max_alloc : 2169 nr_max_alloc : 90054 >>>>>> nr_chunks : 3 nr_chunks : 10 >>>>>> nr_max_chunks : 3 nr_max_chunks : 47 >>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>> empty_pop_pages : 4 empty_pop_pages : 32 >>>>>> >>>>>> With the patchset + debug patch the results are as follows: >>>>>> Patched >>>>>> >>>>>> nr_alloc : 9040 nr_alloc : 97048 >>>>>> nr_dealloc : 6994 nr_dealloc : 94349 >>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2699 >>>>>> nr_max_alloc : 2194 nr_max_alloc : 90054 >>>>>> nr_chunks : 3 nr_chunks : 48 >>>>>> nr_max_chunks : 3 nr_max_chunks : 48 >>>>>> min_alloc_size : 4 min_alloc_size : 4 >>>>>> max_alloc_size : 1072 max_alloc_size : 1072 >>>>>> empty_pop_pages : 12 empty_pop_pages : 54 >>>>>> >>>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)" >>>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)" >>>>>> >>>>>> I've snipped the results of slidelined chunks because they went on for ~600 >>>>>> lines, if you need the full logs let me know. >>>>> Yes, please! That's the most interesting part! >>>> Got it. Pasting the full logs of after the percpu experiment was completed >>> Thanks! >>> >>> Would you mind to apply the following patch and test again? >>> >>> -- >>> >>> diff --git a/mm/percpu.c b/mm/percpu.c >>> index ded3a7541cb2..532c6a7ebdfd 100644 >>> --- a/mm/percpu.c >>> +++ b/mm/percpu.c >>> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr) >>> need_balance = true; >>> break; >>> } >>> + >>> + chunk->depopulated = false; >>> + pcpu_chunk_relocate(chunk, -1); >>> } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk && >>> !chunk->isolated && >>> (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] > >>> >> Sure thing. >> >> I see much lower sideline chunks. In one such test run I saw zero occurrences >> of slidelined chunks >> >> Pasting the full logs as an example: >> >> BEFORE >> Percpu Memory Statistics >> Allocation Info: >> ---------------------------------------- >> unit_size : 655360 >> static_size : 608920 >> reserved_size : 0 >> dyn_size : 46440 >> atom_size : 65536 >> alloc_size : 655360 >> >> Global Stats: >> ---------------------------------------- >> nr_alloc : 9038 >> nr_dealloc : 6992 >> nr_cur_alloc : 2046 >> nr_max_alloc : 2200 >> nr_chunks : 3 >> nr_max_chunks : 3 >> min_alloc_size : 4 >> max_alloc_size : 1072 >> empty_pop_pages : 12 >> >> Per Chunk Stats: >> ---------------------------------------- >> Chunk: <- First Chunk >> nr_alloc : 1092 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 16247 >> free_bytes : 4 >> contig_bytes : 4 >> sum_frag : 4 >> max_frag : 4 >> cur_min_alloc : 4 >> cur_med_alloc : 8 >> cur_max_alloc : 1072 >> memcg_aware : 0 >> >> Chunk: >> nr_alloc : 594 >> max_alloc_size : 992 >> empty_pop_pages : 8 >> first_bit : 456 >> free_bytes : 645008 >> contig_bytes : 319984 >> sum_frag : 325024 >> max_frag : 318680 >> cur_min_alloc : 4 >> cur_med_alloc : 8 >> cur_max_alloc : 424 >> memcg_aware : 0 >> >> Chunk: >> nr_alloc : 360 >> max_alloc_size : 1072 >> empty_pop_pages : 4 >> first_bit : 26595 >> free_bytes : 506640 >> contig_bytes : 506540 >> sum_frag : 100 >> max_frag : 32 >> cur_min_alloc : 4 >> cur_med_alloc : 156 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> >> AFTER >> Percpu Memory Statistics >> Allocation Info: >> ---------------------------------------- >> unit_size : 655360 >> static_size : 608920 >> reserved_size : 0 >> dyn_size : 46440 >> atom_size : 65536 >> alloc_size : 655360 >> >> Global Stats: >> ---------------------------------------- >> nr_alloc : 97046 >> nr_dealloc : 94304 >> nr_cur_alloc : 2742 >> nr_max_alloc : 90054 >> nr_chunks : 11 >> nr_max_chunks : 47 >> min_alloc_size : 4 >> max_alloc_size : 1072 >> empty_pop_pages : 18 >> >> Per Chunk Stats: >> ---------------------------------------- >> Chunk: <- First Chunk >> nr_alloc : 1092 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 16247 >> free_bytes : 4 >> contig_bytes : 4 >> sum_frag : 4 >> max_frag : 4 >> cur_min_alloc : 4 >> cur_med_alloc : 8 >> cur_max_alloc : 1072 >> memcg_aware : 0 >> >> Chunk: >> nr_alloc : 838 >> max_alloc_size : 1072 >> empty_pop_pages : 7 >> first_bit : 464 >> free_bytes : 640476 >> contig_bytes : 290672 >> sum_frag : 349804 >> max_frag : 304344 >> cur_min_alloc : 4 >> cur_med_alloc : 8 >> cur_max_alloc : 1072 >> memcg_aware : 0 >> >> Chunk: >> nr_alloc : 90 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 536 >> free_bytes : 595752 >> contig_bytes : 26164 >> sum_frag : 575132 >> max_frag : 26164 >> cur_min_alloc : 156 >> cur_med_alloc : 1072 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk: >> nr_alloc : 90 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 0 >> free_bytes : 597428 >> contig_bytes : 26164 >> sum_frag : 596848 >> max_frag : 26164 >> cur_min_alloc : 156 >> cur_med_alloc : 312 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk: >> nr_alloc : 92 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 0 >> free_bytes : 595284 >> contig_bytes : 26164 >> sum_frag : 590360 >> max_frag : 26164 >> cur_min_alloc : 156 >> cur_med_alloc : 312 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk: >> nr_alloc : 92 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 0 >> free_bytes : 595284 >> contig_bytes : 26164 >> sum_frag : 583768 >> max_frag : 26164 >> cur_min_alloc : 156 >> cur_med_alloc : 312 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk: >> nr_alloc : 360 >> max_alloc_size : 1072 >> empty_pop_pages : 7 >> first_bit : 26595 >> free_bytes : 506640 >> contig_bytes : 506540 >> sum_frag : 100 >> max_frag : 32 >> cur_min_alloc : 4 >> cur_med_alloc : 156 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk: >> nr_alloc : 12 >> max_alloc_size : 1072 >> empty_pop_pages : 3 >> first_bit : 0 >> free_bytes : 647524 >> contig_bytes : 563492 >> sum_frag : 57872 >> max_frag : 26164 >> cur_min_alloc : 156 >> cur_med_alloc : 312 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk: >> nr_alloc : 0 >> max_alloc_size : 1072 >> empty_pop_pages : 1 >> first_bit : 0 >> free_bytes : 655360 >> contig_bytes : 655360 >> sum_frag : 0 >> max_frag : 0 >> cur_min_alloc : 0 >> cur_med_alloc : 0 >> cur_max_alloc : 0 >> memcg_aware : 1 >> >> Chunk (sidelined): >> nr_alloc : 72 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 0 >> free_bytes : 608344 >> contig_bytes : 145552 >> sum_frag : 590340 >> max_frag : 145552 >> cur_min_alloc : 156 >> cur_med_alloc : 312 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> Chunk (sidelined): >> nr_alloc : 4 >> max_alloc_size : 1072 >> empty_pop_pages : 0 >> first_bit : 0 >> free_bytes : 652748 >> contig_bytes : 426720 >> sum_frag : 426720 >> max_frag : 426720 >> cur_min_alloc : 156 >> cur_med_alloc : 312 >> cur_max_alloc : 1072 >> memcg_aware : 1 >> >> > > Thank you Pratik for testing this and working with us to resolve this. I > greatly appreciate it! > > Thanks, > Dennis No worries at all, glad I could be of some help! Thank you, Pratik