Message ID | 20230622144210.2623299-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | Transparent Contiguous PTEs for User Mappings | expand |
On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > This is a series to opportunistically and transparently use contpte mappings > (set the contiguous bit in ptes) for user memory when those mappings meet the > requirements. It is part of a wider effort to improve performance of the 4K > kernel with the aim of approaching the performance of the 16K kernel, but > without breaking compatibility and without the associated increase in memory. It > also benefits the 16K and 64K kernels by enabling 2M THP, since this is the > contpte size for those kernels. > > Of course this is only one half of the change. We require the mapped physical > memory to be the correct size and alignment for this to actually be useful (i.e. > 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this > problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will > allocate large folios up to the PMD size today, and more filesystems are coming. > And the other half of my work, to enable the use of large folios for anonymous > memory, aims to make contpte sized folios prevalent for anonymous memory too. > > > Dependencies > ------------ > > While there is a complicated set of hard and soft dependencies that this patch > set depends on, I wanted to split it out as best I could and kick off proper > review independently. > > The series applies on top of these other patch sets, with a tree at: > https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1 > > v6.4-rc6 > - base > > set_ptes() > - hard dependency > - Patch set from Matthew Wilcox to set multiple ptes with a single API call > - Allows arch backend to more optimally apply contpte mappings > - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > > ptep_get() pte encapsulation > - hard dependency > - Enabler series from me to ensure none of the core code ever directly > dereferences a pte_t that lies within a live page table. > - Enables gathering access/dirty bits from across the whole contpte range > - in mm-stable and linux-next at time of writing > - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/ > > Report on physically contiguous memory in smaps > - soft dependency > - Enables visibility on how much memory is physically contiguous and how much > is contpte-mapped - useful for debug > - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/ > > Additionally there are a couple of other dependencies: > > anonfolio > - soft dependency > - ensures more anonymous memory is allocated in contpte-sized folios, so > needed to realize the performance improvements (this is the "other half" > mentioned above). > - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/ > - Intending to post v1 shortly. > > exefolio > - soft dependency > - Tweak readahead to ensure executable memory are in 64K-sized folios, so > needed to see reduction in iTLB pressure. > - Don't intend to post this until we are further down the track with contpte > and anonfolio. > > Arm ARM Clarification > - hard dependency > - Current wording disallows the fork() optimization in the final patch. > - Arm (ATG) have proposed tightening the wording to permit it. > - In conversation with partners to check this wouldn't cause problems for any > existing HW deployments > > All of the _hard_ dependencies need to be resolved before this can be considered > for merging. > > > Performance > ----------- > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a > javascript benchmark running in Chromium). Both cases are running on Ampere > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark > is repeated 15 times over 5 reboots and averaged. > > All improvements are relative to baseline-4k. anonfolio and exefolio are as > described above. contpte is this series. (Note that exefolio only gives an > improvement because contpte is already in place). > > Kernel Compilation (smaller is better): > > | kernel | real-time | kern-time | user-time | > |:-------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio | -5.4% | -46.0% | -0.3% | > | contpte | -6.8% | -45.7% | -2.1% | > | exefolio | -8.4% | -46.4% | -3.7% | sorry i am a bit confused. in exefolio case, is anonfolio included? or it only has large cont-pte folios on exe code? in the other words, Does the 8.4% improvement come from iTLB miss reduction only, or from both dTLB and iTLB miss reduction? > | baseline-16k | -8.7% | -49.2% | -3.7% | > | baseline-64k | -10.5% | -66.0% | -3.5% | > > Speedometer 2.0 (bigger is better): > > | kernel | runs_per_min | > |:-------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio | 1.2% | > | contpte | 3.1% | > | exefolio | 4.2% | same question as above. > | baseline-16k | 5.3% | > > I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar > gains. > > I've also verified that running the contpte changes without anonfolio and > exefolio does not cause any regression vs baseline-4k. > > > Opens > ----- > > The only potential issue that I see right now is that due to there only being 1 > access/dirty bit per contpte range, if a single page in the range is > accessed/dirtied then all the adjacent pages are reported as accessed/dirtied > too. Access/dirty is managed by the kernel per _folio_, so this information gets > collapsed down anyway, and nothing changes there. However, the per _page_ > access/dirty information is reported through pagemap to user space. I'm not sure > if this would/should be considered a break? Thoughts? > > Thanks, > Ryan Thanks Barry
On 10/07/2023 13:05, Barry Song wrote: > On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi All, >> [...] >> >> Performance >> ----------- >> >> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a >> javascript benchmark running in Chromium). Both cases are running on Ampere >> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark >> is repeated 15 times over 5 reboots and averaged. >> >> All improvements are relative to baseline-4k. anonfolio and exefolio are as >> described above. contpte is this series. (Note that exefolio only gives an >> improvement because contpte is already in place). >> >> Kernel Compilation (smaller is better): >> >> | kernel | real-time | kern-time | user-time | >> |:-------------|------------:|------------:|------------:| >> | baseline-4k | 0.0% | 0.0% | 0.0% | >> | anonfolio | -5.4% | -46.0% | -0.3% | >> | contpte | -6.8% | -45.7% | -2.1% | >> | exefolio | -8.4% | -46.4% | -3.7% | > > sorry i am a bit confused. in exefolio case, is anonfolio included? > or it only has large cont-pte folios on exe code? in the other words, > Does the 8.4% improvement come from iTLB miss reduction only, > or from both dTLB and iTLB miss reduction? The anonfolio -> contpte -> exefolio results are incremental. So: anonfolio: baseline-4k + anonfolio changes contpte: anonfolio + contpte changes exefolio: contpte + exefolio changes So yes, exefolio includes anonfolio. Sorry for the confusion. > >> | baseline-16k | -8.7% | -49.2% | -3.7% | >> | baseline-64k | -10.5% | -66.0% | -3.5% | >> >> Speedometer 2.0 (bigger is better): >> >> | kernel | runs_per_min | >> |:-------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio | 1.2% | >> | contpte | 3.1% | >> | exefolio | 4.2% | > > same question as above. same answer as above. Thanks, Ryan > >> | baseline-16k | 5.3% | >>