Message ID | 20221108011910.350887-3-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hugetlb MADV_DONTNEED fix and zap_page_range cleanup | expand |
On Nov 7, 2022, at 5:19 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote: > zap_page_range was originally designed to unmap pages within an address > range that could span multiple vmas. However, today all callers of > zap_page_range pass a range entirely within a single vma. In addition, > the mmu notification call within zap_page range is not correct as it > should be vma specific. > > Instead of fixing zap_page_range, change all callers to use zap_vma_range > as it is designed for ranges within a single vma. I understand the argument about mmu notifiers being broken (which is of course fixable). But, are the callers really able to guarantee that the ranges are all in a single VMA? I am not familiar with the users, but how for instance tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some sorts that caused the original VMA to be split?
Hi, Nadav, On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote: > But, are the callers really able to guarantee that the ranges are all in a > single VMA? I am not familiar with the users, but how for instance > tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some > sorts that caused the original VMA to be split? Let me try to answer this one for Mike.. We have two callers in tcp zerocopy code for this function: tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len); tcp_zerocopy_receive[2237] zap_page_range(vma, address, total_bytes_to_map); Both of them take the mmap lock for read, so firstly mprotect is not possible. The 1st call has: mmap_read_lock(current->mm); vma = vma_lookup(current->mm, address); if (!vma || vma->vm_ops != &tcp_vm_ops) { mmap_read_unlock(current->mm); return -EINVAL; } vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); avail_len = min_t(u32, vma_len, inq); total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); if (total_bytes_to_map) { if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) zap_page_range(vma, address, total_bytes_to_map); Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min() of the rest vma range. So total_bytes_to_map will never go beyond the vma. The 2nd call uses maybe_zap_len as len, we need to look two layers of the callers, but ultimately it's something smaller than total_bytes_to_map we discussed. Hopefully it proves 100% safety on tcp zerocopy.
On Nov 10, 2022, at 1:27 PM, Peter Xu <peterx@redhat.com> wrote: > Hi, Nadav, > > On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote: >> But, are the callers really able to guarantee that the ranges are all in a >> single VMA? I am not familiar with the users, but how for instance >> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some >> sorts that caused the original VMA to be split? > > Let me try to answer this one for Mike.. We have two callers in tcp > zerocopy code for this function: > > tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len); > tcp_zerocopy_receive[2237] zap_page_range(vma, address, total_bytes_to_map); > > Both of them take the mmap lock for read, so firstly mprotect is not > possible. > > The 1st call has: > > mmap_read_lock(current->mm); > > vma = vma_lookup(current->mm, address); > if (!vma || vma->vm_ops != &tcp_vm_ops) { > mmap_read_unlock(current->mm); > return -EINVAL; > } > vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); > avail_len = min_t(u32, vma_len, inq); > total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); > if (total_bytes_to_map) { > if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) > zap_page_range(vma, address, total_bytes_to_map); > > Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min() > of the rest vma range. So total_bytes_to_map will never go beyond the vma. > > The 2nd call uses maybe_zap_len as len, we need to look two layers of the > callers, but ultimately it's something smaller than total_bytes_to_map we > discussed. Hopefully it proves 100% safety on tcp zerocopy. Thanks Peter for the detailed explanation. I had another look at the code and indeed it should not break. I am not sure whether users who zero-copy receive and mprotect() part of the memory would not be surprised, but I guess that’s a different story, which I should further study at some point.
On 11/10/22 14:02, Nadav Amit wrote: > On Nov 10, 2022, at 1:27 PM, Peter Xu <peterx@redhat.com> wrote: > > > Hi, Nadav, > > > > On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote: > >> But, are the callers really able to guarantee that the ranges are all in a > >> single VMA? I am not familiar with the users, but how for instance > >> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some > >> sorts that caused the original VMA to be split? > > > > Let me try to answer this one for Mike.. We have two callers in tcp > > zerocopy code for this function: > > > > tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len); > > tcp_zerocopy_receive[2237] zap_page_range(vma, address, total_bytes_to_map); > > > > Both of them take the mmap lock for read, so firstly mprotect is not > > possible. > > > > The 1st call has: > > > > mmap_read_lock(current->mm); > > > > vma = vma_lookup(current->mm, address); > > if (!vma || vma->vm_ops != &tcp_vm_ops) { > > mmap_read_unlock(current->mm); > > return -EINVAL; > > } > > vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); > > avail_len = min_t(u32, vma_len, inq); > > total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); > > if (total_bytes_to_map) { > > if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) > > zap_page_range(vma, address, total_bytes_to_map); > > > > Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min() > > of the rest vma range. So total_bytes_to_map will never go beyond the vma. > > > > The 2nd call uses maybe_zap_len as len, we need to look two layers of the > > callers, but ultimately it's something smaller than total_bytes_to_map we > > discussed. Hopefully it proves 100% safety on tcp zerocopy. > > Thanks Peter for the detailed explanation. > > I had another look at the code and indeed it should not break. I am not sure > whether users who zero-copy receive and mprotect() part of the memory would > not be surprised, but I guess that’s a different story, which I should > further study at some point. I did audit all calling sites and am fairly certain passed ranges are within a single vma. Because of this, Peter suggested removing zap_page_range. If there is concern, we can just fix up the mmu notifiers in zap_page_range and leave it. This is what is done in the patch which is currently in mm-hotfixes-unstable.
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index 99ae81ab91a7..05aa0c68b609 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -141,10 +141,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm)) - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); #ifdef CONFIG_COMPAT_VDSO if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm)) - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); #endif } diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index 4abc01949702..69210ca35dc8 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -123,7 +123,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, &vvar_spec)) - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); } mmap_read_unlock(mm); diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c index 40f5ae5e1238..475925723981 100644 --- a/arch/powerpc/platforms/book3s/vas-api.c +++ b/arch/powerpc/platforms/book3s/vas-api.c @@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf) /* * When the LPAR lost credits due to core removal or during * migration, invalidate the existing mapping for the current - * paste addresses and set windows in-active (zap_page_range in + * paste addresses and set windows in-active (zap_vma_range in * reconfig_close_windows()). * New mapping will be done later after migration or new credits * available. So continue to receive faults if the user space diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c index 4ad6e510d405..b70afaa5e399 100644 --- a/arch/powerpc/platforms/pseries/vas.c +++ b/arch/powerpc/platforms/pseries/vas.c @@ -760,7 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds, * is done before the original mmap() and after the ioctl. */ if (vma) - zap_page_range(vma, vma->vm_start, + zap_vma_range(vma, vma->vm_start, vma->vm_end - vma->vm_start); mmap_write_unlock(task_ref->mm); diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c index 123d05255fcf..47b767215d15 100644 --- a/arch/riscv/kernel/vdso.c +++ b/arch/riscv/kernel/vdso.c @@ -127,10 +127,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, vdso_info.dm)) - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); #ifdef CONFIG_COMPAT if (vma_is_special_mapping(vma, compat_vdso_info.dm)) - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); #endif } diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index 119328e1e2b3..af50c3cefe45 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -78,7 +78,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) if (!vma_is_special_mapping(vma, &vvar_mapping)) continue; - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); break; } mmap_read_unlock(mm); diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 02d15c8dc92e..32f1d4a3d241 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -723,7 +723,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to) if (is_vm_hugetlb_page(vma)) continue; size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK)); - zap_page_range(vma, vmaddr, size); + zap_vma_range(vma, vmaddr, size); } mmap_read_unlock(gmap->mm); } diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index d45c5fcfeac2..b3c269cf28d0 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -134,7 +134,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, &vvar_mapping)) - zap_page_range(vma, vma->vm_start, size); + zap_vma_range(vma, vma->vm_start, size); } mmap_read_unlock(mm); diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c index 1c39cfce32fa..063a9b4a6c02 100644 --- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -1012,7 +1012,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item, if (vma) { trace_binder_unmap_user_start(alloc, index); - zap_page_range(vma, page_addr, PAGE_SIZE); + zap_vma_range(vma, page_addr, PAGE_SIZE); trace_binder_unmap_user_end(alloc, index); } diff --git a/include/linux/mm.h b/include/linux/mm.h index d205bcd9cd2e..16052a628ab2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1838,8 +1838,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); -void zap_page_range(struct vm_area_struct *vma, unsigned long address, - unsigned long size); void zap_vma_range(struct vm_area_struct *vma, unsigned long address, unsigned long size); void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt, diff --git a/mm/memory.c b/mm/memory.c index af3a4724b464..a9b2aa1149b2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1686,36 +1686,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt, mmu_notifier_invalidate_range_end(&range); } -/** - * zap_page_range - remove user pages in a given range - * @vma: vm_area_struct holding the applicable pages - * @start: starting address of pages to zap - * @size: number of bytes to zap - * - * Caller must protect the VMA list - */ -void zap_page_range(struct vm_area_struct *vma, unsigned long start, - unsigned long size) -{ - struct maple_tree *mt = &vma->vm_mm->mm_mt; - unsigned long end = start + size; - struct mmu_notifier_range range; - struct mmu_gather tlb; - MA_STATE(mas, mt, vma->vm_end, vma->vm_end); - - lru_add_drain(); - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, - start, start + size); - tlb_gather_mmu(&tlb, vma->vm_mm); - update_hiwater_rss(vma->vm_mm); - mmu_notifier_invalidate_range_start(&range); - do { - unmap_single_vma(&tlb, vma, start, range.end, NULL); - } while ((vma = mas_find(&mas, end - 1)) != NULL); - mmu_notifier_invalidate_range_end(&range); - tlb_finish_mmu(&tlb); -} - /** * __zap_page_range_single - remove user pages in a given range * @vma: vm_area_struct holding the applicable pages diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7e9d8d857ecc..dbfa8b2062fc 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2601,7 +2601,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb) * * The caller must hold lock_page_memcg(). Most callers have the folio * locked. A few have the folio blocked from truncation through other - * means (eg zap_page_range() has it mapped and is holding the page table + * means (eg zap_vma_range() has it mapped and is holding the page table * lock). This can also be called from mark_buffer_dirty(), which I * cannot prove is always protected against truncate. */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index de8f0cd7cb32..dea1d72ae4e2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2092,7 +2092,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma, maybe_zap_len = total_bytes_to_map - /* All bytes to map */ *length + /* Mapped or pending */ (pages_remaining * PAGE_SIZE); /* Failed map. */ - zap_page_range(vma, *address, maybe_zap_len); + zap_vma_range(vma, *address, maybe_zap_len); err = 0; } @@ -2100,7 +2100,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma, unsigned long leftover_pages = pages_remaining; int bytes_mapped; - /* We called zap_page_range, try to reinsert. */ + /* We called zap_vma_range, try to reinsert. */ err = vm_insert_pages(vma, *address, pending_pages, &pages_remaining); @@ -2234,7 +2234,7 @@ static int tcp_zerocopy_receive(struct sock *sk, total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); if (total_bytes_to_map) { if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) - zap_page_range(vma, address, total_bytes_to_map); + zap_vma_range(vma, address, total_bytes_to_map); zc->length = total_bytes_to_map; zc->recv_skip_hint = 0; } else {
zap_page_range was originally designed to unmap pages within an address range that could span multiple vmas. However, today all callers of zap_page_range pass a range entirely within a single vma. In addition, the mmu notification call within zap_page range is not correct as it should be vma specific. Instead of fixing zap_page_range, change all callers to use zap_vma_range as it is designed for ranges within a single vma. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- arch/arm64/kernel/vdso.c | 4 ++-- arch/powerpc/kernel/vdso.c | 2 +- arch/powerpc/platforms/book3s/vas-api.c | 2 +- arch/powerpc/platforms/pseries/vas.c | 2 +- arch/riscv/kernel/vdso.c | 4 ++-- arch/s390/kernel/vdso.c | 2 +- arch/s390/mm/gmap.c | 2 +- arch/x86/entry/vdso/vma.c | 2 +- drivers/android/binder_alloc.c | 2 +- include/linux/mm.h | 2 -- mm/memory.c | 30 ------------------------- mm/page-writeback.c | 2 +- net/ipv4/tcp.c | 6 ++--- 13 files changed, 15 insertions(+), 47 deletions(-)