Message ID | 20220113180308.15610-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Add hugetlb MADV_DONTNEED support | expand |
On 13.01.22 19:03, Mike Kravetz wrote: > Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP > testing. However, mremap support was recently added in commit > 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed > vma"). While attempting to enable mremap support in the test, it was > discovered that the mremap test indirectly depends on MADV_DONTNEED. > > hugetlb does not support MADV_DONTNEED. However, the only thing > preventing support is a check in can_madv_lru_vma(). Simply removing > the check will enable support. > > This is sent as a RFC because there is no existing use case calling > for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. > However, adding support makes sense as it is fairly trivial and brings > hugetlb functionality more in line with 'normal' memory. > Just a note: QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) but instead always goes either via hugetlbfs or via memfd. For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems to get the job done (IOW: also discards private anon pages). See the comments in the QEMU code below. I remember that that is somewhat inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) to make sure a) All file pages are removed b) All private anon pages are removed IIRC hugetlbfs really is different in that regard, but maybe other fs behave similarly. That's why QEMU was able to live for now without MADV_DONTNEED support for hugetlbfs and most probably won't ever need it. ... /* The logic here is messy; * madvise DONTNEED fails for hugepages * fallocate works on hugepages and shmem * shared anonymous memory requires madvise REMOVE */ need_madvise = (rb->page_size == qemu_host_page_size); need_fallocate = rb->fd != -1; if (need_fallocate) { /* For a file, this causes the area of the file to be zero'd * if read, and for hugetlbfs also causes it to be unmapped * so a userfault will trigger. */ #ifdef CONFIG_FALLOCATE_PUNCH_HOLE ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, start, length); if (ret) { ret = -errno; error_report("ram_block_discard_range: Failed to fallocate " "%s:%" PRIx64 " +%zx (%d)", rb->idstr, start, length, ret); goto err; } #else ret = -ENOSYS; error_report("ram_block_discard_range: fallocate not available/file" "%s:%" PRIx64 " +%zx (%d)", rb->idstr, start, length, ret); goto err; #endif } if (need_madvise) { /* For normal RAM this causes it to be unmapped, * for shared memory it causes the local mapping to disappear * and to fall back on the file contents (which we just * fallocate'd away). */ #if defined(CONFIG_MADVISE) if (qemu_ram_is_shared(rb) && rb->fd < 0) { ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE); } else { ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED); } if (ret) { ret = -errno; error_report("ram_block_discard_range: Failed to discard range " "%s:%" PRIx64 " +%zx (%d)", rb->idstr, start, length, ret); goto err; } #else ...
On Thu, Jan 27, 2022 at 3:57 AM David Hildenbrand <david@redhat.com> wrote: > > On 13.01.22 19:03, Mike Kravetz wrote: > > Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP > > testing. However, mremap support was recently added in commit > > 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed > > vma"). While attempting to enable mremap support in the test, it was > > discovered that the mremap test indirectly depends on MADV_DONTNEED. > > > > hugetlb does not support MADV_DONTNEED. However, the only thing > > preventing support is a check in can_madv_lru_vma(). Simply removing > > the check will enable support. > > > > This is sent as a RFC because there is no existing use case calling > > for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. > > However, adding support makes sense as it is fairly trivial and brings > > hugetlb functionality more in line with 'normal' memory. > > > > Just a note: > > QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) > but instead always goes either via hugetlbfs or via memfd. > > For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems > to get the job done (IOW: also discards private anon pages). See the > comments in the QEMU code below. I remember that that is somewhat > inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we > always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) > to make sure > > a) All file pages are removed > b) All private anon pages are removed > > IIRC hugetlbfs really is different in that regard, but maybe other fs > behave similarly. > > That's why QEMU was able to live for now without MADV_DONTNEED support > for hugetlbfs and most probably won't ever need it. Agreed, all of the production use cases I'm aware of use hugetlbfs, not MAP_HUGE... But, I would say this is convenient for testing purposes. It's slightly more convenient to not have to mount hugetlbfs / perform the associated setup for tests. Perhaps that's only a small motivation for enabling this, but then again Mike's patch to do so is likewise very small. :) > > > ... > /* The logic here is messy; > * madvise DONTNEED fails for hugepages > * fallocate works on hugepages and shmem > * shared anonymous memory requires madvise REMOVE > */ > need_madvise = (rb->page_size == qemu_host_page_size); > need_fallocate = rb->fd != -1; > if (need_fallocate) { > /* For a file, this causes the area of the file to be zero'd > * if read, and for hugetlbfs also causes it to be unmapped > * so a userfault will trigger. > */ > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > start, length); > if (ret) { > ret = -errno; > error_report("ram_block_discard_range: Failed to fallocate " > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > } > #else > ret = -ENOSYS; > error_report("ram_block_discard_range: fallocate not available/file" > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > #endif > } > if (need_madvise) { > /* For normal RAM this causes it to be unmapped, > * for shared memory it causes the local mapping to disappear > * and to fall back on the file contents (which we just > * fallocate'd away). > */ > #if defined(CONFIG_MADVISE) > if (qemu_ram_is_shared(rb) && rb->fd < 0) { > ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE); > } else { > ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED); > } > if (ret) { > ret = -errno; > error_report("ram_block_discard_range: Failed to discard range " > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > } > #else > ... > > -- > Thanks, > > David / dhildenb >
On 1/27/22 03:57, David Hildenbrand wrote: > On 13.01.22 19:03, Mike Kravetz wrote: >> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP >> testing. However, mremap support was recently added in commit >> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed >> vma"). While attempting to enable mremap support in the test, it was >> discovered that the mremap test indirectly depends on MADV_DONTNEED. >> >> hugetlb does not support MADV_DONTNEED. However, the only thing >> preventing support is a check in can_madv_lru_vma(). Simply removing >> the check will enable support. >> >> This is sent as a RFC because there is no existing use case calling >> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. >> However, adding support makes sense as it is fairly trivial and brings >> hugetlb functionality more in line with 'normal' memory. >> > > Just a note: > > QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) > but instead always goes either via hugetlbfs or via memfd. > > For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems > to get the job done (IOW: also discards private anon pages). See the > comments in the QEMU code below. I remember that that is somewhat > inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we > always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) > to make sure > > a) All file pages are removed > b) All private anon pages are removed > > IIRC hugetlbfs really is different in that regard, but maybe other fs > behave similarly. Yes it is really different. And, some might even consider that a bug? Imagine if those private anon (COW) pages contain important data. They could be unmapped/freed by some other process that has write access to the hugetlb file on which the private mapping is based. I believe this same issue exists for hugetlbfs ftruncate. When fallocate hole punch support was added, it was based on the ftruncate functionality. I am hesitant to change the behavior of hugetlb hole punch or truncate as people may be relying on that behavior today. Your QEMU example is one such case. Thanks,
On 27.01.22 18:52, Axel Rasmussen wrote: > On Thu, Jan 27, 2022 at 3:57 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 13.01.22 19:03, Mike Kravetz wrote: >>> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP >>> testing. However, mremap support was recently added in commit >>> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed >>> vma"). While attempting to enable mremap support in the test, it was >>> discovered that the mremap test indirectly depends on MADV_DONTNEED. >>> >>> hugetlb does not support MADV_DONTNEED. However, the only thing >>> preventing support is a check in can_madv_lru_vma(). Simply removing >>> the check will enable support. >>> >>> This is sent as a RFC because there is no existing use case calling >>> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. >>> However, adding support makes sense as it is fairly trivial and brings >>> hugetlb functionality more in line with 'normal' memory. >>> >> >> Just a note: >> >> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) >> but instead always goes either via hugetlbfs or via memfd. >> >> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems >> to get the job done (IOW: also discards private anon pages). See the >> comments in the QEMU code below. I remember that that is somewhat >> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we >> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) >> to make sure >> >> a) All file pages are removed >> b) All private anon pages are removed >> >> IIRC hugetlbfs really is different in that regard, but maybe other fs >> behave similarly. >> >> That's why QEMU was able to live for now without MADV_DONTNEED support >> for hugetlbfs and most probably won't ever need it. > > Agreed, all of the production use cases I'm aware of use hugetlbfs, > not MAP_HUGE... > > But, I would say this is convenient for testing purposes. It's > slightly more convenient to not have to mount hugetlbfs / perform the > associated setup for tests. Creating a memfd is not too hard, but yes, not a single-liner. Maybe the uffd test should go via memfds for hugetlb instead. But maybe that limits the mremap functionality? No expert. > > Perhaps that's only a small motivation for enabling this, but then > again Mike's patch to do so is likewise very small. :) ... and apparently buggy :P
On 27.01.22 18:55, Mike Kravetz wrote: > On 1/27/22 03:57, David Hildenbrand wrote: >> On 13.01.22 19:03, Mike Kravetz wrote: >>> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP >>> testing. However, mremap support was recently added in commit >>> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed >>> vma"). While attempting to enable mremap support in the test, it was >>> discovered that the mremap test indirectly depends on MADV_DONTNEED. >>> >>> hugetlb does not support MADV_DONTNEED. However, the only thing >>> preventing support is a check in can_madv_lru_vma(). Simply removing >>> the check will enable support. >>> >>> This is sent as a RFC because there is no existing use case calling >>> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. >>> However, adding support makes sense as it is fairly trivial and brings >>> hugetlb functionality more in line with 'normal' memory. >>> >> >> Just a note: >> >> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) >> but instead always goes either via hugetlbfs or via memfd. >> >> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems >> to get the job done (IOW: also discards private anon pages). See the >> comments in the QEMU code below. I remember that that is somewhat >> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we >> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) >> to make sure >> >> a) All file pages are removed >> b) All private anon pages are removed >> >> IIRC hugetlbfs really is different in that regard, but maybe other fs >> behave similarly. > > Yes it is really different. And, some might even consider that a bug? > Imagine if those private anon (COW) pages contain important data. They > could be unmapped/freed by some other process that has write access to > the hugetlb file on which the private mapping is based. Right, that's also what I once worried about in QEMU code. But then I realized that any kind of fallocate(FALLOC_FL_PUNCH_HOLE) on a file shared by multiple parties mapped MAP_PRIVATE might be bogus already. Assume you have a VM running with MAP_SHARED on a file. The file contains the VM memory state. Assume you pause the VM and want to convert it into 2 instances that will continue running independently based on the captured file state. You'd have to MAP_PRIVATE the file such that both VMs start with the original state and only see their modifications. ... but if one process decides to fallocate(FALLOC_FL_PUNCH_HOLE), for example, due to memory ballooning, even a page that's still shared by both processes (!COW), you'd corrupt the other VM. So my assumption is that MAP_PRIVATE in combination with fallocate(FALLOC_FL_PUNCH_HOLE) on a file mapped by more than one process is just bogus already. > > I believe this same issue exists for hugetlbfs ftruncate. When fallocate > hole punch support was added, it was based on the ftruncate functionality. > > I am hesitant to change the behavior of hugetlb hole punch or truncate > as people may be relying on that behavior today. Your QEMU example is > one such case. Yes, I assume we're stuck with that.