Message ID | 20231001005659.2185316-1-riel@surriel.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlbfs: close race between MADV_DONTNEED and page fault | expand |
On Sat, 30 Sep 2023 20:55:47 -0400 riel@surriel.com wrote: > v5: somehow a __vma_private_lock(vma) test failed to make it from my tree into the v4 series, fix that > v4: fix unmap_vmas locking issue pointed out by Mike Kravetz, and resulting lockdep fallout > v3: fix compile error w/ lockdep and test case errors with patch 3 > v2: fix the locking bug found with the libhugetlbfs tests. > > Malloc libraries, like jemalloc and tcalloc, take decisions on when > to call madvise independently from the code in the main application. > > This sometimes results in the application page faulting on an address, > right after the malloc library has shot down the backing memory with > MADV_DONTNEED. > > Usually this is harmless, because we always have some 4kB pages > sitting around to satisfy a page fault. However, with hugetlbfs > systems often allocate only the exact number of huge pages that > the application wants. > > Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of > any lock taken on the page fault path, which can open up the following > race condition: > > CPU 1 CPU 2 > > MADV_DONTNEED > unmap page > shoot down TLB entry > page fault > fail to allocate a huge page > killed with SIGBUS > free page > > Fix that race by extending the hugetlb_vma_lock locking scheme to also > cover private hugetlb mappings (with resv_map), and pulling the locking > from __unmap_hugepage_final_range into helper functions called from > zap_page_range_single. This ensures page faults stay locked out of > the MADV_DONTNEED VMA until the huge pages have actually been freed. Didn't we decide that [1/3] and [2/3] should be cc:stable? > The third patch in the series is more of an RFC. Using the > invalidate_lock instead of the hugetlb_vma_lock greatly simplifies > the code, but at the cost of turning a per-VMA lock into a lock > per backing hugetlbfs file, which could slow things down when > multiple processes are mapping the same hugetlbfs file. "could slow things down" is testable-for? This third one I'd queue up for testing for a 6.7-rc1 merge, so I'll split the series apart. Not a problem, but it would be a little better if things were originally packaged that way.