[v4,0/7] introduce vm_flags modifier functions

Message ID	20230126193752.297968-1-surenb@google.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 26 Jan 2023 11:37:45 -0800 Mime-Version: 1.0 Message-ID: <20230126193752.297968-1-surenb@google.com> Subject: [PATCH v4 0/7] introduce vm_flags modifier functions From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, paulmck@kernel.org, mingo@redhat.com, will@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, rppt@kernel.org, jannh@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, leewalsh@google.com, posk@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	introduce vm_flags modifier functions \| expand [v4,0/7] introduce vm_flags modifier functions [v4,1/7] kernel/fork: convert vma assignment to a memcpy [v4,2/7] mm: introduce vma->vm_flags wrapper functions [v4,3/7] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK [v4,4/7] mm: replace vma->vm_flags direct modifications with modifier calls [v4,5/7] mm: replace vma->vm_flags indirect modification in ksm_madvise [v4,6/7] mm: introduce __vm_flags_mod and use it in untrack_pfn [v4,7/7] mm: export dump_mm()

Suren Baghdasaryan Jan. 26, 2023, 7:37 p.m. UTC

This patchset was originally published as a part of per-VMA locking [1] and
was split after suggestion that it's viable on its own and to facilitate
the review process. It is now a preprequisite for the next version of per-VMA
lock patchset, which reuses vm_flags modifier functions to lock the VMA when
vm_flags are being updated.

VMA vm_flags modifications are usually done under exclusive mmap_lock
protection because this attrubute affects other decisions like VMA merging
or splitting and races should be prevented. Introduce vm_flags modifier
functions to enforce correct locking.

The patchset applies cleanly over mm-unstable branch of mm tree.

Changes since v3 [2]
- Fixed build breakage in nommu.c introduced in previous version,
per Andrew Morton
- Added back data_race() hint in vm_area_dup, per Mel Gorman
- Renamed vm_flags modifiers, per Andrew Morton, Mike Rapoport and Mel Gorman
- Changed vm_flags_mod to reset vm_flags with one assignment, per Mel Gorman
- Added comments about the need to copy vm_flags before ksm_madvise,
per Mel Gorman
- Added clarifications for __vm_flags_mod usage, per Mel Gorman

[1] https://lore.kernel.org/all/20230109205336.3665937-1-surenb@google.com/
[2] https://lore.kernel.org/lkml/20230125233554.153109-1-surenb@google.com/

Suren Baghdasaryan (7):
  kernel/fork: convert vma assignment to a memcpy
  mm: introduce vma->vm_flags wrapper functions
  mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
  mm: replace vma->vm_flags direct modifications with modifier calls
  mm: replace vma->vm_flags indirect modification in ksm_madvise
  mm: introduce __vm_flags_mod and use it in untrack_pfn
  mm: export dump_mm()

 arch/arm/kernel/process.c                     |  2 +-
 arch/ia64/mm/init.c                           |  8 +--
 arch/loongarch/include/asm/tlb.h              |  2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c            |  6 +-
 arch/powerpc/kvm/book3s_xive_native.c         |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c       |  2 +-
 arch/powerpc/platforms/cell/spufs/file.c      | 14 ++---
 arch/s390/mm/gmap.c                           |  9 ++-
 arch/x86/entry/vsyscall/vsyscall_64.c         |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c              |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c                |  2 +-
 arch/x86/mm/pat/memtype.c                     | 14 +++--
 arch/x86/um/mem_32.c                          |  2 +-
 drivers/acpi/pfr_telemetry.c                  |  2 +-
 drivers/android/binder.c                      |  3 +-
 drivers/char/mspec.c                          |  2 +-
 drivers/crypto/hisilicon/qm.c                 |  2 +-
 drivers/dax/device.c                          |  2 +-
 drivers/dma/idxd/cdev.c                       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c       |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c     |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  4 +-
 drivers/gpu/drm/drm_gem.c                     |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c          |  3 +-
 drivers/gpu/drm/drm_gem_shmem_helper.c        |  2 +-
 drivers/gpu/drm/drm_vm.c                      |  8 +--
 drivers/gpu/drm/etnaviv/etnaviv_gem.c         |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c       |  4 +-
 drivers/gpu/drm/gma500/framebuffer.c          |  2 +-
 drivers/gpu/drm/i810/i810_dma.c               |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c      |  4 +-
 drivers/gpu/drm/mediatek/mtk_drm_gem.c        |  2 +-
 drivers/gpu/drm/msm/msm_gem.c                 |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c            |  3 +-
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c   |  3 +-
 drivers/gpu/drm/tegra/gem.c                   |  5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c               |  3 +-
 drivers/gpu/drm/virtio/virtgpu_vram.c         |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c       |  3 +-
 drivers/hsi/clients/cmt_speech.c              |  2 +-
 drivers/hwtracing/intel_th/msu.c              |  2 +-
 drivers/hwtracing/stm/core.c                  |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c         |  4 +-
 drivers/infiniband/hw/mlx5/main.c             |  4 +-
 drivers/infiniband/hw/qib/qib_file_ops.c      | 13 ++---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  2 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |  2 +-
 .../common/videobuf2/videobuf2-dma-contig.c   |  2 +-
 .../common/videobuf2/videobuf2-vmalloc.c      |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c     |  4 +-
 drivers/media/v4l2-core/videobuf-vmalloc.c    |  2 +-
 drivers/misc/cxl/context.c                    |  2 +-
 drivers/misc/habanalabs/common/memory.c       |  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c         |  4 +-
 drivers/misc/habanalabs/gaudi2/gaudi2.c       |  8 +--
 drivers/misc/habanalabs/goya/goya.c           |  4 +-
 drivers/misc/ocxl/context.c                   |  4 +-
 drivers/misc/ocxl/sysfs.c                     |  2 +-
 drivers/misc/open-dice.c                      |  4 +-
 drivers/misc/sgi-gru/grufile.c                |  4 +-
 drivers/misc/uacce/uacce.c                    |  2 +-
 drivers/sbus/char/oradax.c                    |  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c               |  2 +-
 drivers/scsi/sg.c                             |  2 +-
 .../staging/media/atomisp/pci/hmm/hmm_bo.c    |  2 +-
 drivers/staging/media/deprecated/meye/meye.c  |  4 +-
 .../media/deprecated/stkwebcam/stk-webcam.c   |  2 +-
 drivers/target/target_core_user.c             |  2 +-
 drivers/uio/uio.c                             |  2 +-
 drivers/usb/core/devio.c                      |  3 +-
 drivers/usb/mon/mon_bin.c                     |  3 +-
 drivers/vdpa/vdpa_user/iova_domain.c          |  2 +-
 drivers/vfio/pci/vfio_pci_core.c              |  2 +-
 drivers/vhost/vdpa.c                          |  2 +-
 drivers/video/fbdev/68328fb.c                 |  2 +-
 drivers/video/fbdev/core/fb_defio.c           |  4 +-
 drivers/xen/gntalloc.c                        |  2 +-
 drivers/xen/gntdev.c                          |  4 +-
 drivers/xen/privcmd-buf.c                     |  2 +-
 drivers/xen/privcmd.c                         |  4 +-
 fs/aio.c                                      |  2 +-
 fs/cramfs/inode.c                             |  2 +-
 fs/erofs/data.c                               |  2 +-
 fs/exec.c                                     |  4 +-
 fs/ext4/file.c                                |  2 +-
 fs/fuse/dax.c                                 |  2 +-
 fs/hugetlbfs/inode.c                          |  4 +-
 fs/orangefs/file.c                            |  3 +-
 fs/proc/task_mmu.c                            |  2 +-
 fs/proc/vmcore.c                              |  3 +-
 fs/userfaultfd.c                              |  2 +-
 fs/xfs/xfs_file.c                             |  2 +-
 include/linux/mm.h                            | 58 +++++++++++++++++--
 include/linux/mm_types.h                      | 10 +++-
 include/linux/pgtable.h                       |  5 +-
 kernel/bpf/ringbuf.c                          |  4 +-
 kernel/bpf/syscall.c                          |  4 +-
 kernel/events/core.c                          |  2 +-
 kernel/fork.c                                 |  4 +-
 kernel/kcov.c                                 |  2 +-
 kernel/relay.c                                |  2 +-
 mm/debug.c                                    |  1 +
 mm/hugetlb.c                                  |  4 +-
 mm/madvise.c                                  |  2 +-
 mm/memory.c                                   | 19 +++---
 mm/memremap.c                                 |  4 +-
 mm/mlock.c                                    | 12 ++--
 mm/mmap.c                                     | 32 +++++-----
 mm/mprotect.c                                 |  2 +-
 mm/mremap.c                                   |  8 +--
 mm/nommu.c                                    | 11 ++--
 mm/secretmem.c                                |  2 +-
 mm/shmem.c                                    |  2 +-
 mm/vmalloc.c                                  |  2 +-
 net/ipv4/tcp.c                                |  4 +-
 security/selinux/selinuxfs.c                  |  6 +-
 sound/core/oss/pcm_oss.c                      |  2 +-
 sound/core/pcm_native.c                       |  9 +--
 sound/soc/pxa/mmp-sspa.c                      |  2 +-
 sound/usb/usx2y/us122l.c                      |  4 +-
 sound/usb/usx2y/usX2Yhwdep.c                  |  2 +-
 sound/usb/usx2y/usx2yhwdeppcm.c               |  2 +-
 127 files changed, 300 insertions(+), 234 deletions(-)

Mike Rapoport Jan. 26, 2023, 7:44 p.m. UTC | #1

On Thu, Jan 26, 2023 at 11:37:51AM -0800, Suren Baghdasaryan wrote:
> There are scenarios when vm_flags can be modified without exclusive
> mmap_lock, such as:
> - after VMA was isolated and mmap_lock was downgraded or dropped
> - in exit_mmap when there are no other mm users and locking is unnecessary
> Introduce __vm_flags_mod to avoid assertions when the caller takes
> responsibility for the required locking.
> Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
> flags modification to avoid assertion.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>

> ---
>  arch/x86/mm/pat/memtype.c | 10 +++++++---
>  include/linux/mm.h        | 14 ++++++++++++--
>  include/linux/pgtable.h   |  5 +++--
>  mm/memory.c               | 13 +++++++------
>  mm/memremap.c             |  4 ++--
>  mm/mmap.c                 | 16 ++++++++++------
>  6 files changed, 41 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 6ca51b1aa5d9..691bf8934b6f 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1046,7 +1046,7 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
>   * can be for the entire vma (in which case pfn, size are zero).
>   */
>  void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> -		 unsigned long size)
> +		 unsigned long size, bool mm_wr_locked)
>  {
>  	resource_size_t paddr;
>  	unsigned long prot;
> @@ -1065,8 +1065,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
>  		size = vma->vm_end - vma->vm_start;
>  	}
>  	free_pfn_range(paddr, size);
> -	if (vma)
> -		vm_flags_clear(vma, VM_PAT);
> +	if (vma) {
> +		if (mm_wr_locked)
> +			vm_flags_clear(vma, VM_PAT);
> +		else
> +			__vm_flags_mod(vma, 0, VM_PAT);
> +	}
>  }
>  
>  /*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3c7fc3ecaece..a00fdeb4492d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -656,6 +656,16 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
>  	ACCESS_PRIVATE(vma, __vm_flags) &= ~flags;
>  }
>  
> +/*
> + * Use only if VMA is not part of the VMA tree or has no other users and
> + * therefore needs no locking.
> + */
> +static inline void __vm_flags_mod(struct vm_area_struct *vma,
> +				  vm_flags_t set, vm_flags_t clear)
> +{
> +	vm_flags_init(vma, (vma->vm_flags | set) & ~clear);
> +}
> +
>  /*
>   * Use only when the order of set/clear operations is unimportant, otherwise
>   * use vm_flags_{set|clear} explicitly.
> @@ -664,7 +674,7 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
>  				vm_flags_t set, vm_flags_t clear)
>  {
>  	mmap_assert_write_locked(vma->vm_mm);
> -	vm_flags_init(vma, (vma->vm_flags | set) & ~clear);
> +	__vm_flags_mod(vma, set, clear);
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> @@ -2102,7 +2112,7 @@ static inline void zap_vma_pages(struct vm_area_struct *vma)
>  }
>  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>  		struct vm_area_struct *start_vma, unsigned long start,
> -		unsigned long end);
> +		unsigned long end, bool mm_wr_locked);
>  
>  struct mmu_notifier_range;
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5fd45454c073..c63cd44777ec 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1185,7 +1185,8 @@ static inline int track_pfn_copy(struct vm_area_struct *vma)
>   * can be for the entire vma (in which case pfn, size are zero).
>   */
>  static inline void untrack_pfn(struct vm_area_struct *vma,
> -			       unsigned long pfn, unsigned long size)
> +			       unsigned long pfn, unsigned long size,
> +			       bool mm_wr_locked)
>  {
>  }
>  
> @@ -1203,7 +1204,7 @@ extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>  			     pfn_t pfn);
>  extern int track_pfn_copy(struct vm_area_struct *vma);
>  extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> -			unsigned long size);
> +			unsigned long size, bool mm_wr_locked);
>  extern void untrack_pfn_moved(struct vm_area_struct *vma);
>  #endif
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index a6316cda0e87..7a04a1130ec1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1613,7 +1613,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  static void unmap_single_vma(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		unsigned long end_addr,
> -		struct zap_details *details)
> +		struct zap_details *details, bool mm_wr_locked)
>  {
>  	unsigned long start = max(vma->vm_start, start_addr);
>  	unsigned long end;
> @@ -1628,7 +1628,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  		uprobe_munmap(vma, start, end);
>  
>  	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0);
> +		untrack_pfn(vma, 0, 0, mm_wr_locked);
>  
>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
> @@ -1675,7 +1675,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>   */
>  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>  		struct vm_area_struct *vma, unsigned long start_addr,
> -		unsigned long end_addr)
> +		unsigned long end_addr, bool mm_wr_locked)
>  {
>  	struct mmu_notifier_range range;
>  	struct zap_details details = {
> @@ -1689,7 +1689,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>  				start_addr, end_addr);
>  	mmu_notifier_invalidate_range_start(&range);
>  	do {
> -		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> +		unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
> +				 mm_wr_locked);
>  	} while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
> @@ -1723,7 +1724,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */
> -	unmap_single_vma(&tlb, vma, address, end, details);
> +	unmap_single_vma(&tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_finish_mmu(&tlb);
>  }
> @@ -2492,7 +2493,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  
>  	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
>  	if (err)
> -		untrack_pfn(vma, pfn, PAGE_ALIGN(size));
> +		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
>  	return err;
>  }
>  EXPORT_SYMBOL(remap_pfn_range);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 08cbf54fe037..2f88f43d4a01 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -129,7 +129,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
>  	}
>  	mem_hotplug_done();
>  
> -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
> +	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
>  	pgmap_array_delete(range);
>  }
>  
> @@ -276,7 +276,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>  	if (!is_private)
>  		kasan_remove_zero_shadow(__va(range->start), range_len(range));
>  err_kasan:
> -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
> +	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
>  err_pfn_remap:
>  	pgmap_array_delete(range);
>  	return error;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3d9b14d5f933..429e42c8fccc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -78,7 +78,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
>  static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
>  		struct vm_area_struct *vma, struct vm_area_struct *prev,
>  		struct vm_area_struct *next, unsigned long start,
> -		unsigned long end);
> +		unsigned long end, bool mm_wr_locked);
>  
>  static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
>  {
> @@ -2136,14 +2136,14 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
>  static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
>  		struct vm_area_struct *vma, struct vm_area_struct *prev,
>  		struct vm_area_struct *next,
> -		unsigned long start, unsigned long end)
> +		unsigned long start, unsigned long end, bool mm_wr_locked)
>  {
>  	struct mmu_gather tlb;
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm);
>  	update_hiwater_rss(mm);
> -	unmap_vmas(&tlb, mt, vma, start, end);
> +	unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked);
>  	free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>  				 next ? next->vm_start : USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb);
> @@ -2391,7 +2391,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  			mmap_write_downgrade(mm);
>  	}
>  
> -	unmap_region(mm, &mt_detach, vma, prev, next, start, end);
> +	/*
> +	 * We can free page tables without write-locking mmap_lock because VMAs
> +	 * were isolated before we downgraded mmap_lock.
> +	 */
> +	unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
>  	/* Statistics and freeing VMAs */
>  	mas_set(&mas_detach, start);
>  	remove_mt(mm, &mas_detach);
> @@ -2704,7 +2708,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  
>  		/* Undo any partial mapping done by a device driver. */
>  		unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start,
> -			     vma->vm_end);
> +			     vma->vm_end, true);
>  	}
>  	if (file && (vm_flags & VM_SHARED))
>  		mapping_unmap_writable(file->f_mapping);
> @@ -3031,7 +3035,7 @@ void exit_mmap(struct mm_struct *mm)
>  	tlb_gather_mmu_fullmm(&tlb, mm);
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> -	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> +	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
>  	mmap_read_unlock(mm);
>  
>  	/*
> -- 
> 2.39.1
>

Alex Williamson March 14, 2023, 8:11 p.m. UTC | #2

On Thu, 26 Jan 2023 11:37:45 -0800
Suren Baghdasaryan <surenb@google.com> wrote:

> This patchset was originally published as a part of per-VMA locking [1] and
> was split after suggestion that it's viable on its own and to facilitate
> the review process. It is now a preprequisite for the next version of per-VMA
> lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> vm_flags are being updated.
> 
> VMA vm_flags modifications are usually done under exclusive mmap_lock
> protection because this attrubute affects other decisions like VMA merging
> or splitting and races should be prevented. Introduce vm_flags modifier
> functions to enforce correct locking.
> 
> The patchset applies cleanly over mm-unstable branch of mm tree.

With this series, vfio-pci developed a bunch of warnings around not
holding the mmap_lock write semaphore while calling
io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().

I suspect vdpa has the same issue for their use of remap_pfn_range()
from their fault handler, JasonW, MST, FYI.

It also looks like gru_fault() would have the same issue, Dimitri.

In all cases, we're preemptively setting vm_flags to what
remap_pfn_range_notrack() uses, so I thought we were safe here as I
specifically remember trying to avoid changing vm_flags from the
fault handler.  But apparently that doesn't take into account
track_pfn_remap() where VM_PAT comes into play.

The reason for using remap_pfn_range() on fault in vfio-pci is that
we're mapping device MMIO to userspace, where that MMIO can be disabled
and we'd rather zap the mapping when that occurs so that we can sigbus
the user rather than allow the user to trigger potentially fatal bus
errors on the host.

Peter Xu has suggested offline that a non-lazy approach to reinsert the
mappings might be more inline with mm expectations relative to touching
vm_flags during fault.  What's the right solution here?  Can the fault
handling be salvaged, is proactive remapping the right approach, or is
there something better?  Thanks,

Alex

Suren Baghdasaryan March 17, 2023, 7:08 p.m. UTC | #3

On Tue, Mar 14, 2023 at 1:11 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Thu, 26 Jan 2023 11:37:45 -0800
> Suren Baghdasaryan <surenb@google.com> wrote:
>
> > This patchset was originally published as a part of per-VMA locking [1] and
> > was split after suggestion that it's viable on its own and to facilitate
> > the review process. It is now a preprequisite for the next version of per-VMA
> > lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> > vm_flags are being updated.
> >
> > VMA vm_flags modifications are usually done under exclusive mmap_lock
> > protection because this attrubute affects other decisions like VMA merging
> > or splitting and races should be prevented. Introduce vm_flags modifier
> > functions to enforce correct locking.
> >
> > The patchset applies cleanly over mm-unstable branch of mm tree.
>
> With this series, vfio-pci developed a bunch of warnings around not
> holding the mmap_lock write semaphore while calling
> io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().
>
> I suspect vdpa has the same issue for their use of remap_pfn_range()
> from their fault handler, JasonW, MST, FYI.
>
> It also looks like gru_fault() would have the same issue, Dimitri.
>
> In all cases, we're preemptively setting vm_flags to what
> remap_pfn_range_notrack() uses, so I thought we were safe here as I
> specifically remember trying to avoid changing vm_flags from the
> fault handler.  But apparently that doesn't take into account
> track_pfn_remap() where VM_PAT comes into play.
>
> The reason for using remap_pfn_range() on fault in vfio-pci is that
> we're mapping device MMIO to userspace, where that MMIO can be disabled
> and we'd rather zap the mapping when that occurs so that we can sigbus
> the user rather than allow the user to trigger potentially fatal bus
> errors on the host.
>
> Peter Xu has suggested offline that a non-lazy approach to reinsert the
> mappings might be more inline with mm expectations relative to touching
> vm_flags during fault.  What's the right solution here?  Can the fault
> handling be salvaged, is proactive remapping the right approach, or is
> there something better?  Thanks,

Hi Alex,
If in your case it's safe to change vm_flags without holding exclusive
mmap_lock, maybe you can use __vm_flags_mod() the way I used it in
https://lore.kernel.org/all/20230126193752.297968-7-surenb@google.com,
while explaining why this should be safe?

>
> Alex
>

Alex Williamson March 17, 2023, 10:40 p.m. UTC | #4

On Fri, 17 Mar 2023 12:08:32 -0700
Suren Baghdasaryan <surenb@google.com> wrote:

> On Tue, Mar 14, 2023 at 1:11 PM Alex Williamson
> <alex.williamson@redhat.com> wrote:
> >
> > On Thu, 26 Jan 2023 11:37:45 -0800
> > Suren Baghdasaryan <surenb@google.com> wrote:
> >  
> > > This patchset was originally published as a part of per-VMA locking [1] and
> > > was split after suggestion that it's viable on its own and to facilitate
> > > the review process. It is now a preprequisite for the next version of per-VMA
> > > lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> > > vm_flags are being updated.
> > >
> > > VMA vm_flags modifications are usually done under exclusive mmap_lock
> > > protection because this attrubute affects other decisions like VMA merging
> > > or splitting and races should be prevented. Introduce vm_flags modifier
> > > functions to enforce correct locking.
> > >
> > > The patchset applies cleanly over mm-unstable branch of mm tree.  
> >
> > With this series, vfio-pci developed a bunch of warnings around not
> > holding the mmap_lock write semaphore while calling
> > io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().
> >
> > I suspect vdpa has the same issue for their use of remap_pfn_range()
> > from their fault handler, JasonW, MST, FYI.
> >
> > It also looks like gru_fault() would have the same issue, Dimitri.
> >
> > In all cases, we're preemptively setting vm_flags to what
> > remap_pfn_range_notrack() uses, so I thought we were safe here as I
> > specifically remember trying to avoid changing vm_flags from the
> > fault handler.  But apparently that doesn't take into account
> > track_pfn_remap() where VM_PAT comes into play.
> >
> > The reason for using remap_pfn_range() on fault in vfio-pci is that
> > we're mapping device MMIO to userspace, where that MMIO can be disabled
> > and we'd rather zap the mapping when that occurs so that we can sigbus
> > the user rather than allow the user to trigger potentially fatal bus
> > errors on the host.
> >
> > Peter Xu has suggested offline that a non-lazy approach to reinsert the
> > mappings might be more inline with mm expectations relative to touching
> > vm_flags during fault.  What's the right solution here?  Can the fault
> > handling be salvaged, is proactive remapping the right approach, or is
> > there something better?  Thanks,  
> 
> Hi Alex,
> If in your case it's safe to change vm_flags without holding exclusive
> mmap_lock, maybe you can use __vm_flags_mod() the way I used it in
> https://lore.kernel.org/all/20230126193752.297968-7-surenb@google.com,
> while explaining why this should be safe?

Hi Suren,

Thanks for the reply, but I'm not sure I'm following.  Are you
suggesting a bool arg added to io_remap_pfn_range(), or some new
variant of that function to conditionally use __vm_flags_mod() in place
of vm_flags_set() across the call chain?  Thanks,

Alex

Suren Baghdasaryan March 17, 2023, 11:04 p.m. UTC | #5

On Fri, Mar 17, 2023 at 3:41 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Fri, 17 Mar 2023 12:08:32 -0700
> Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Tue, Mar 14, 2023 at 1:11 PM Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > >
> > > On Thu, 26 Jan 2023 11:37:45 -0800
> > > Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > This patchset was originally published as a part of per-VMA locking [1] and
> > > > was split after suggestion that it's viable on its own and to facilitate
> > > > the review process. It is now a preprequisite for the next version of per-VMA
> > > > lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> > > > vm_flags are being updated.
> > > >
> > > > VMA vm_flags modifications are usually done under exclusive mmap_lock
> > > > protection because this attrubute affects other decisions like VMA merging
> > > > or splitting and races should be prevented. Introduce vm_flags modifier
> > > > functions to enforce correct locking.
> > > >
> > > > The patchset applies cleanly over mm-unstable branch of mm tree.
> > >
> > > With this series, vfio-pci developed a bunch of warnings around not
> > > holding the mmap_lock write semaphore while calling
> > > io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().
> > >
> > > I suspect vdpa has the same issue for their use of remap_pfn_range()
> > > from their fault handler, JasonW, MST, FYI.
> > >
> > > It also looks like gru_fault() would have the same issue, Dimitri.
> > >
> > > In all cases, we're preemptively setting vm_flags to what
> > > remap_pfn_range_notrack() uses, so I thought we were safe here as I
> > > specifically remember trying to avoid changing vm_flags from the
> > > fault handler.  But apparently that doesn't take into account
> > > track_pfn_remap() where VM_PAT comes into play.
> > >
> > > The reason for using remap_pfn_range() on fault in vfio-pci is that
> > > we're mapping device MMIO to userspace, where that MMIO can be disabled
> > > and we'd rather zap the mapping when that occurs so that we can sigbus
> > > the user rather than allow the user to trigger potentially fatal bus
> > > errors on the host.
> > >
> > > Peter Xu has suggested offline that a non-lazy approach to reinsert the
> > > mappings might be more inline with mm expectations relative to touching
> > > vm_flags during fault.  What's the right solution here?  Can the fault
> > > handling be salvaged, is proactive remapping the right approach, or is
> > > there something better?  Thanks,
> >
> > Hi Alex,
> > If in your case it's safe to change vm_flags without holding exclusive
> > mmap_lock, maybe you can use __vm_flags_mod() the way I used it in
> > https://lore.kernel.org/all/20230126193752.297968-7-surenb@google.com,
> > while explaining why this should be safe?
>
> Hi Suren,
>
> Thanks for the reply, but I'm not sure I'm following.  Are you
> suggesting a bool arg added to io_remap_pfn_range(), or some new
> variant of that function to conditionally use __vm_flags_mod() in place
> of vm_flags_set() across the call chain?  Thanks,

I think either way could work but after taking a closer look, both
ways would be quite ugly. If we could somehow identify that we are
handling a page fault and use __vm_flags_mod() without additional
parameters it would be more palatable IMHO...
Peter's suggestion to avoid touching vm_flags during fault would be
much cleaner but I'm not sure how easily that can be done.

>
> Alex
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Jason Gunthorpe March 22, 2023, 1:48 p.m. UTC | #6

On Tue, Mar 14, 2023 at 02:11:44PM -0600, Alex Williamson wrote:
> On Thu, 26 Jan 2023 11:37:45 -0800
> Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > This patchset was originally published as a part of per-VMA locking [1] and
> > was split after suggestion that it's viable on its own and to facilitate
> > the review process. It is now a preprequisite for the next version of per-VMA
> > lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> > vm_flags are being updated.
> > 
> > VMA vm_flags modifications are usually done under exclusive mmap_lock
> > protection because this attrubute affects other decisions like VMA merging
> > or splitting and races should be prevented. Introduce vm_flags modifier
> > functions to enforce correct locking.
> > 
> > The patchset applies cleanly over mm-unstable branch of mm tree.
> 
> With this series, vfio-pci developed a bunch of warnings around not
> holding the mmap_lock write semaphore while calling
> io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().
> 
> I suspect vdpa has the same issue for their use of remap_pfn_range()
> from their fault handler, JasonW, MST, FYI.

Yeah, IMHO this whole approach has always been a bit sketchy, it was
never intended that remap_pfn would be called from a fault handler,
you are supposed to use a vmf_insert_pfn() type API from fault
handlers.

> The reason for using remap_pfn_range() on fault in vfio-pci is that
> we're mapping device MMIO to userspace, where that MMIO can be disabled
> and we'd rather zap the mapping when that occurs so that we can sigbus
> the user rather than allow the user to trigger potentially fatal bus
> errors on the host.
 
> Peter Xu has suggested offline that a non-lazy approach to reinsert the
> mappings might be more inline with mm expectations relative to touching
> vm_flags during fault.  

Yes, I feel the same - along with completing the address_space
conversion you had started. IIRC that was part of the reason we needed
this design in vfio.

Jason

[v4,0/7] introduce vm_flags modifier functions

Message

Comments