Message ID | 20240829165627.2256514-1-david@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT | expand |
On 29.08.24 18:56, David Hildenbrand wrote: > RMAP overhaul and optimizations, PTE batching, large mapcount, > folio_likely_mapped_shared() introduction and optimizations, page_mapcount > cleanups and preparations ... it's been quite some work to get to this > point. > > Next up is being able to identify -- without false positives, without > page-mapcounts and without page table/rmap scanning -- whether a > large folio is "mapped exclusively" into a single MM, and using that > information to implement Copy-on-Write reuse and to improve > folio_likely_mapped_shared() for large folios. > > ... and based on that, finally introducing a kernel config option that > let's us not use+maintain per-page mapcounts in large folios, improving > performance of (un)map operations today, taking one step towards > supporting large folios > PMD_SIZE, and preparing for the bright future > where we might no longer have a mapcount per page at all. > > The bigger picture was presented at LSF/MM [1]. > > This series is effectively a follow-up on my early work from last > year [2], which proposed a precise way to identify whether a large folio is > "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. > > While that advanced approach has been simplified and optimized in the > meantime, let's start with something simpler first -- "certainly mapped > exclusive" vs. ""maybe mapped shared" -- so we can start learning about > the effects and TODOs that some of the implied changes of losing > per-page mapcounts has. > > I have plans to exchange the simple approach used in this series at some > point by the advanced approach, but one important thing to learn if the > imprecision in the simple approach is relevant in practice. > > 64BIT only, and unless enabled in kconfig, this series should for now > not have any impact. > > > 1) Patch Organization > ===================== > > Patch #1 -> #4: make more room on 64BIT in order-1 folios > > Patch #5 -> #7: prepare for MM owner tracking of large folios > > Patch #8: implement a simple MM owner tracking approach for large folios > > patch #9: simple optimization > > Patch #10: COW reuse for PTE-mapped anon THP > > Patch #11 -> #17: introduce and implement CONFIG_NO_PAGE_MAPCOUNT > > > 2) MM owner tracking > ==================== > > Similar to my advanced approach [2], we assign each MM a unique 20-bit ID > ("MM ID"), to be able to squeeze more information in our folios. > > Each large folios can store two MM-ID+mapcount combination: > * mm0_id + mm0_mapcount > * mm1_id + mm1_mapcount > > Combined with the large mapcount, we can reliably identify whether one > of these MMs is the current owner (-> owns all mappings) or even holds > all folio references (-> owns all mappings, and all references are from > mappings). > > Stored MM IDs can only change if the corresponding mapcount is logically > 0, and if the folio is currently "mapped exclusively". > > As long as only two MMs map folio pages at a time, we can reliably identify > whether a large folio is "mapped shared" or "mapped exclusively". The > approach is precise. > > Any MM mapping the folio while two other MMs are already mapping the folio, > will lead to a "mapped shared" detection even after all other MMs stopped > mapping the folio and it is actually "mapped exclusively": we can have > false positives but never false negatives when detecting "mapped shared". > > So that's where the approach gets imprecise. > > For now, we use a bit-spinlock to sync the large mapcount + MM IDs + MM > mapcounts, and make sure we do keep the machinery fast, to not degrade > (un)map performance too much: for example, we make sure to only use a > single atomic (when grabbing the bit-spinlock), like we would already > perform when updating the large mapcount. > > In the future, we might be able to use an arch_spin_lock(), but that's > future work. > > > 3) CONFIG_NO_PAGE_MAPCOUNT > ========================== > > patch #11 -> #17 spell out and document what exactly is affected when > not maintaining the per-page mapcounts in large folios anymore. > > For example, as we cannot maintain folio->_nr_pages_mapped anymore when > (un)mapping pages, we'll account a complete folio as mapped if a > single page is mapped. > > As another example, we might now under-estimate the USS (Unique Set Size) > of a process, but never over-estimate it. > > With a more elaborate approach for MM-owner tracking like #1, some things > could be improved (e.g., USS to some degree), but somethings just cannot be > handled like we used to without these per-page mapcounts (e.g., > folio->_nr_pages_mapped). > > > 4) Performance > ============== > > The following kernel config combinations are possible: > > * Base: CONFIG_PAGE_MAPCOUNT > -> (existing) page-mapcount tracking > * MM-ID: CONFIG_MM_ID && CONFIG_PAGE_MAPCOUNT > -> page-mapcount + MM-ID tracking > * No-Mapcount: CONFIG_MM_ID && CONFIG_NO_PAGE_MAPCOUNT > -> MM-ID tracking > > > I run my PTE-mapped-THP microbenchmarks [3] and vm-scalability on a machine > with two NUMA nodes, with a 10-core Intel(R) Xeon(R) Silver 4210R CPU @ > 2.40GHz and 16 GiB of memory each. > > 4.1) PTE-mapped-THP microbenchmarks > ----------------------------------- > > All benchmarks allocate 1 GiB of THPs of a given size, to then fork()/ > munmap/... PMD-sized THPs are mapped by PTEs first. > > Numbers are increase (+) / reduction (-) in runtime. Reduction (-) is > good. "Base" is the baseline. > > munmap: munmap() the allocated memory. > > Folio Size | MM-ID | No-Mapcount > -------------------------------- > 16 KiB | 2 % | -8 % > 32 KiB | 3 % | -9 % > 64 KiB | 4 % | -16 % > 128 KiB | 3 % | -17 % > 256 KiB | 1 % | -23 % > 512 KiB | 1 % | -26 % > 1024 KiB | 0 % | -29 % > 2048 KiB | 0 % | -31 % > > -> 32-128 with MM-ID are a bit unexpected: we would expect to see the worst > case with the smallest size (16 KiB). But for these sizes also the STDEV > is between 1% and 2%, in contrast to the others (< 1 %). Maybe some > weird interaction with PCP/buddy. > > fork: fork() > > Folio Size | MM-ID | No-Mapcount > -------------------------------- > 16 KiB | 4 % | -9 % > 32 KiB | 1 % | -12 % > 64 KiB | 0 % | -15 % > 128 KiB | 0 % | -15 % > 256 KiB | 0 % | -16 % > 512 KiB | 0 % | -16 % > 1024 KiB | 0 % | -17 % > 2048 KiB | -1 % | -21 % > > -> Slight slowdown with MM-ID for the smallest folio size (more what we > expect in contrast to munmap()). > > cow-byte: fork() and keep the child running. write one byte to each > individual page, measuring the duration of all writes. > > Folio Size | MM-ID | No-Mapcount > -------------------------------- > 16 KiB | 0 % | 0 % > 32 KiB | 0 % | 0 % > 64 KiB | 0 % | 0 % > 128 KiB | 0 % | 0 % > 256 KiB | 0 % | 0 % > 512 KiB | 0 % | 0 % > 1024 KiB | 0 % | 0 % > 2048 KiB | 0 % | 0 % > > -> All other overhead dominates even when effectively unmapping > single pages of large folios when replacing them by a copy during write > faults. No change, which is great! > > reuse-byte: fork() and wait until the child quit. write one byte to each > individual page, measuring the duration of all writes. > > Folio Size | MM-ID | No-Mapcount > -------------------------------- > 16 KiB | -66 % | -66 % > 32 KiB | -65 % | -65 % > 64 KiB | -64 % | -64 % > 128 KiB | -64 % | -64 % > 256 KiB | -64 % | -64 % > 512 KiB | -64 % | -64 % > 1024 KiB | -64 % | -64 % > 2048 KiB | -64 % | -64 % > > -> No surprise, we reuse all pages instead of copying them. > > child-reuse-bye: fork() and unmap the memory in the parent. write one byte > to each individual page in the child, measuring the duration of all writes. > > Folio Size | MM-ID | No-Mapcount > -------------------------------- > 16 KiB | -66 % | -66 % > 32 KiB | -65 % | -65 % > 64 KiB | -64 % | -64 % > 128 KiB | -64 % | -64 % > 256 KiB | -64 % | -64 % > 512 KiB | -64 % | -64 % > 1024 KiB | -64 % | -64 % > 2048 KiB | -64 % | -64 % > > -> Same thing, we reuse all pages instead of copying them. > > > For 4 KiB, there is no change in any benchmark, as expected. > > > 4.2) vm-scalability > ------------------- > > For now I only ran anon COW tests. I use 1 GiB per child process and use > one child per core (-> 20). > > case-anon-cow-rand: random writes > > There is effectively no change (<0.6% throughput difference). > > case-anon-cow-seq: sequential writes > > MM-ID has up to 2% *lower* throughout than Base, not really correlating to > folio size. The difference is almost as large as the STDEV (1% - 2%), > though. It looks like there is a very slight effective slowdown. > > No-Mapcount has up to 3% *higher* throughput than Base, not really > correlating to the folio size. However, also here the difference is almost > as large as the STDEV (up to 2%). It looks like there is a very slight > effective speedup. > > In summary, no earth-shattering slowdown with MM-ID (and we just recently > optimized folio->_nr_pages_mapped to give us some speedup :) ), and > another nice improvement with No-Mapcount. > > > I did a bunch of cross-compiles and the build bots turned out very helpful > over the last months. I did quite some testing with LTP and selftests, > but x86-64 only. Gentle ping. I might soon have capacity to continue working on this. If there is no further feedback I'll rebase and resend.