Message ID | 20201108141113.65450-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Free some vmemmap pages of hugetlb page | expand |
Thanks for continuing to work this Muchun! On 11/8/20 6:10 AM, Muchun Song wrote: ... > For tail pages, the value of compound_head is the same. So we can reuse > first page of tail page structs. We map the virtual addresses of the > remaining 6 pages of tail page structs to the first tail page struct, > and then free these 6 pages. Therefore, we need to reserve at least 2 > pages as vmemmap areas. > > When a hugetlbpage is freed to the buddy system, we should allocate six > pages for vmemmap pages and restore the previous mapping relationship. > > If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very > substantial gain. Is that 4095 number accurate? Are we not using two pages of struct pages as in the 2MB case? Also, because we are splitting the huge page mappings in the vmemmap additional PTE pages will need to be allocated. Therefore, some additional page table pages may need to be allocated so that we can free the pages of struct pages. The net savings may be less than what is stated above. Perhaps this should mention that allocation of additional page table pages may be required? ... > Because there are vmemmap page tables reconstruction on the freeing/allocating > path, it increases some overhead. Here are some overhead analysis. > > 1) Allocating 10240 2MB hugetlb pages. > > a) With this patch series applied: > # time echo 10240 > /proc/sys/vm/nr_hugepages > > real 0m0.166s > user 0m0.000s > sys 0m0.166s > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 1868 |@@@@@@@@@@@ | > [32K, 64K) 10 | | > [64K, 128K) 2 | | > > b) Without this patch series: > # time echo 10240 > /proc/sys/vm/nr_hugepages > > real 0m0.066s > user 0m0.000s > sys 0m0.066s > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 62 | | > [16K, 32K) 2 | | > > Summarize: this feature is about ~2x slower than before. > > 2) Freeing 10240 @MB hugetlb pages. > > a) With this patch series applied: > # time echo 0 > /proc/sys/vm/nr_hugepages > > real 0m0.004s > user 0m0.000s > sys 0m0.002s > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > b) Without this patch series: > # time echo 0 > /proc/sys/vm/nr_hugepages > > real 0m0.077s > user 0m0.001s > sys 0m0.075s > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 287 |@ | > [16K, 32K) 3 | | > > Summarize: The overhead of __free_hugepage is about ~2-4x slower than before. > But according to the allocation test above, I think that here is > also ~2x slower than before. > > But why the 'real' time of patched is smaller than before? Because > In this patch series, the freeing hugetlb is asynchronous(through > kwoker). > > Although the overhead has increased. But the overhead is not on the > allocating/freeing of each hugetlb page, it is only once when we reserve > some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation > is successful, the subsequent allocating, freeing and using are the same > as before (not patched). So I think that the overhead is acceptable. Thank you for benchmarking. There are still some instances where huge pages are allocated 'on the fly' instead of being pulled from the pool. Michal pointed out the case of page migration. It is also possible for someone to use hugetlbfs without pre-allocating huge pages to the pool. I remember the use case pointed out in commit 099730d67417. It says, "I have a hugetlbfs user which is never explicitly allocating huge pages with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time." In this case, I suspect they were using 'page fault' allocation for initialization much like someone using /proc/sys/vm/nr_hugepages. So, the overhead may not be as noticeable.
On Wed, Nov 11, 2020 at 3:23 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > Thanks for continuing to work this Muchun! > > On 11/8/20 6:10 AM, Muchun Song wrote: > ... > > For tail pages, the value of compound_head is the same. So we can reuse > > first page of tail page structs. We map the virtual addresses of the > > remaining 6 pages of tail page structs to the first tail page struct, > > and then free these 6 pages. Therefore, we need to reserve at least 2 > > pages as vmemmap areas. > > > > When a hugetlbpage is freed to the buddy system, we should allocate six > > pages for vmemmap pages and restore the previous mapping relationship. > > > > If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very > > substantial gain. > > Is that 4095 number accurate? Are we not using two pages of struct pages > as in the 2MB case? Oh, yeah, here should be 4094 and subtract page tables. For a 1GB HugeTLB page, it should be 4086 pages. Thanks for pointing out this problem. > > Also, because we are splitting the huge page mappings in the vmemmap > additional PTE pages will need to be allocated. Therefore, some additional > page table pages may need to be allocated so that we can free the pages > of struct pages. The net savings may be less than what is stated above. > > Perhaps this should mention that allocation of additional page table pages > may be required? Yeah, you are right. In the later patch, I will rework the analysis here. Make it more clear and accurate. > > ... > > Because there are vmemmap page tables reconstruction on the freeing/allocating > > path, it increases some overhead. Here are some overhead analysis. > > > > 1) Allocating 10240 2MB hugetlb pages. > > > > a) With this patch series applied: > > # time echo 10240 > /proc/sys/vm/nr_hugepages > > > > real 0m0.166s > > user 0m0.000s > > sys 0m0.166s > > > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [16K, 32K) 1868 |@@@@@@@@@@@ | > > [32K, 64K) 10 | | > > [64K, 128K) 2 | | > > > > b) Without this patch series: > > # time echo 10240 > /proc/sys/vm/nr_hugepages > > > > real 0m0.066s > > user 0m0.000s > > sys 0m0.066s > > > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [8K, 16K) 62 | | > > [16K, 32K) 2 | | > > > > Summarize: this feature is about ~2x slower than before. > > > > 2) Freeing 10240 @MB hugetlb pages. > > > > a) With this patch series applied: > > # time echo 0 > /proc/sys/vm/nr_hugepages > > > > real 0m0.004s > > user 0m0.000s > > sys 0m0.002s > > > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > b) Without this patch series: > > # time echo 0 > /proc/sys/vm/nr_hugepages > > > > real 0m0.077s > > user 0m0.001s > > sys 0m0.075s > > > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > > Attaching 2 probes... > > > > @latency: > > [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [8K, 16K) 287 |@ | > > [16K, 32K) 3 | | > > > > Summarize: The overhead of __free_hugepage is about ~2-4x slower than before. > > But according to the allocation test above, I think that here is > > also ~2x slower than before. > > > > But why the 'real' time of patched is smaller than before? Because > > In this patch series, the freeing hugetlb is asynchronous(through > > kwoker). > > > > Although the overhead has increased. But the overhead is not on the > > allocating/freeing of each hugetlb page, it is only once when we reserve > > some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation > > is successful, the subsequent allocating, freeing and using are the same > > as before (not patched). So I think that the overhead is acceptable. > > Thank you for benchmarking. There are still some instances where huge pages > are allocated 'on the fly' instead of being pulled from the pool. Michal > pointed out the case of page migration. It is also possible for someone to > use hugetlbfs without pre-allocating huge pages to the pool. I remember the > use case pointed out in commit 099730d67417. It says, "I have a hugetlbfs > user which is never explicitly allocating huge pages with 'nr_hugepages'. > They only set 'nr_overcommit_hugepages' and then let the pages be allocated > from the buddy allocator at fault time." In this case, I suspect they were > using 'page fault' allocation for initialization much like someone using > /proc/sys/vm/nr_hugepages. So, the overhead may not be as noticeable. Thanks for pointing out this using case. > > -- > Mike Kravetz