Message ID | 20230202112915.867409-4-usama.anjum@collabora.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Implement IOCTL to get and/or the clear info about PTEs | expand |
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which > pages have been written-to. > - Find pages which have been written-to and write protect the pages > (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > > To get information about which pages have been written-to and/or write > protect the pages, following must be performed first in order: > - The userfaultfd file descriptor is created with userfaultfd syscall. > - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. > - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode > through UFFDIO_REGISTER IOCTL. > Then the any part of the registered memory or the whole memory region > can be write protected using the UFFDIO_WRITEPROTECT IOCTL or > PAGEMAP_SCAN IOCTL. > > struct pagemap_scan_args is used as the argument of the IOCTL. In this > struct: > - The range is specified through start and len. > - The output buffer of struct page_region array and size is specified as > vec and vec_len. > - The optional maximum requested pages are specified in the max_pages. > - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE > is the only added flag at this time. > - The masks are specified in required_mask, anyof_mask, excluded_ mask > and return_mask. > > This IOCTL can be extended to get information about more PTE bits. This > IOCTL doesn't support hugetlbs at the moment. No information about > hugetlb can be obtained. This patch has evolved from a basic patch from > Gabriel Krisman Bertazi. > > Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > --- > Changes in v10: > - move changes in tools/include/uapi/linux/fs.h to separate patch > - update commit message > > Change in v8: > - Correct is_pte_uffd_wp() > - Improve readability and error checks > - Remove some un-needed code > > Changes in v7: > - Rebase on top of latest next > - Fix some corner cases > - Base soft-dirty on the uffd wp async > - Update the terminologies > - Optimize the memory usage inside the ioctl > > Changes in v6: > - Rename variables and update comments > - Make IOCTL independent of soft_dirty config > - Change masks and bitmap type to _u64 > - Improve code quality > > Changes in v5: > - Remove tlb flushing even for clear operation > > Changes in v4: > - Update the interface and implementation > > Changes in v3: > - Tighten the user-kernel interface by using explicit types and add more > error checking > > Changes in v2: > - Convert the interface from syscall to ioctl > - Remove pidfd support as it doesn't make sense in ioctl > --- > fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 50 +++++++ > 2 files changed, 340 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index e35a0398db63..c6bde19d63d9 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -19,6 +19,7 @@ > #include <linux/shmem_fs.h> > #include <linux/uaccess.h> > #include <linux/pkeys.h> > +#include <linux/minmax.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, > } > #endif > > +static inline bool is_pte_uffd_wp(pte_t pte) > +{ > + if ((pte_present(pte) && pte_uffd_wp(pte)) || > + (pte_swp_uffd_wp_any(pte))) > + return true; > + return false; Sorry I should have mentioned this earlier: you can directly return here. return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte); > +} > + > +static inline bool is_pmd_uffd_wp(pmd_t pmd) > +{ > + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || > + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) > + return true; > + return false; Same here. > +} > + > #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) > static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, > unsigned long addr, pmd_t *pmdp) > @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) > return 0; > } > > +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ > + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) > +#define IS_GET_OP(a) (a->vec) > +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) > + > +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ > + (wt | file << 1 | present << 2 | swap << 3) > +#define IS_WT_REQUIRED(a) \ > + ((a->required_mask & PAGE_IS_WRITTEN) || \ > + (a->anyof_mask & PAGE_IS_WRITTEN)) > + > +struct pagemap_scan_private { > + struct page_region *vec; > + struct page_region prev; > + unsigned long vec_len, vec_index; > + unsigned int max_pages, found_pages, flags; > + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; > +}; > + > +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + > + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) Should this be: (IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma))) Instead? > + return -EPERM; > + if (vma->vm_flags & VM_PFNMAP) > + return 1; > + return 0; > +} > + > +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, unsigned long addr, > + unsigned int len) > +{ > + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); > + bool cpy = true; > + struct page_region *prev = &p->prev; Nit: switch the above two lines? > + > + if (HAS_NO_SPACE(p)) > + return -ENOSPC; > + > + if (p->max_pages && p->found_pages + len >= p->max_pages) > + len = p->max_pages - p->found_pages; If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC? > + if (!len) > + return -EINVAL; > + > + if (p->required_mask) > + cpy = ((p->required_mask & cur) == p->required_mask); > + if (cpy && p->anyof_mask) > + cpy = (p->anyof_mask & cur); > + if (cpy && p->excluded_mask) > + cpy = !(p->excluded_mask & cur); > + bitmap = cur & p->return_mask; > + if (cpy && bitmap) { > + if ((prev->len) && (prev->bitmap == bitmap) && > + (prev->start + prev->len * PAGE_SIZE == addr)) { > + prev->len += len; > + p->found_pages += len; > + } else if (p->vec_index < p->vec_len) { > + if (prev->len) { > + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > + p->vec_index++; > + } IIUC you can have: int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC; if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; } return 0; } Then call it here. I think it can also be called below to replace export_prev_to_out(). > + prev->start = addr; > + prev->len = len; > + prev->bitmap = bitmap; > + p->found_pages += len; > + } else { > + return -ENOSPC; > + } > + } > + return 0; > +} > + > +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, > + unsigned long *vec_index) > +{ > + struct page_region *prev = &p->prev; > + > + if (prev->len) { > + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) > + return -EFAULT; > + p->vec_index++; > + (*vec_index)++; > + prev->len = 0; > + } > + return 0; > +} > + > +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long addr = end; This assignment is useless? > + spinlock_t *ptl; > + int ret = 0; > + pte_t *pte; > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + bool pmd_wt; > + > + pmd_wt = !is_pmd_uffd_wp(*pmd); > + /* > + * Break huge page into small pages if operation needs to be performed is > + * on a portion of the huge page. > + */ > + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { > + spin_unlock(ptl); > + split_huge_pmd(vma, pmd, start); > + goto process_smaller_pages; > + } > + if (IS_GET_OP(p)) > + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), > + is_swap_pmd(*pmd), p, start, > + (end - start)/PAGE_SIZE); > + spin_unlock(ptl); > + if (!ret) { > + if (pmd_wt && IS_WP_ENGAGE_OP(p)) > + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); > + } > + return ret; > + } > +process_smaller_pages: > + if (pmd_trans_unstable(pmd)) > + return 0; > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > + if (IS_GET_OP(p)) { > + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { > + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, > + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); > + if (ret) > + break; > + } > + } > + pte_unmap_unlock(pte - 1, ptl); > + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) > + uffd_wp_range(walk->mm, vma, start, addr - start, true); > + > + cond_resched(); > + return ret; > +} > + > +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, > + struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + int ret = 0; > + > + if (vma) > + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, > + (end - addr)/PAGE_SIZE); > + return ret; > +} > + > +/* No hugetlb support is present. */ > +static const struct mm_walk_ops pagemap_scan_ops = { > + .test_walk = pagemap_scan_test_walk, > + .pmd_entry = pagemap_scan_pmd_entry, > + .pte_hole = pagemap_scan_pte_hole, > +}; > + > +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) > +{ > + unsigned long empty_slots, vec_index = 0; > + unsigned long __user start, end; > + unsigned long __start, __end; > + struct page_region __user *vec; > + struct pagemap_scan_private p; > + int ret = 0; > + > + start = (unsigned long)untagged_addr(arg->start); > + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); > + > + /* Validate memory ranges */ > + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) > + return -EINVAL; > + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || > + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) > + return -EINVAL; > + > + /* Detect illegal flags and masks */ > + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || > + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || > + (arg->return_mask & ~PAGEMAP_BITS_ALL)) > + return -EINVAL; > + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || > + !arg->return_mask)) > + return -EINVAL; > + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ > + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || > + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) > + return -EINVAL; I think you said you'll clean this up a bit. I don't think so.. > + > + end = start + arg->len; > + p.max_pages = arg->max_pages; > + p.found_pages = 0; > + p.flags = arg->flags; > + p.required_mask = arg->required_mask; > + p.anyof_mask = arg->anyof_mask; > + p.excluded_mask = arg->excluded_mask; > + p.return_mask = arg->return_mask; > + p.prev.len = 0; > + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > + > + if (IS_GET_OP(arg)) { > + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); > + if (!p.vec) > + return -ENOMEM; > + } else { > + p.vec = NULL; > + } > + __start = __end = start; > + while (!ret && __end < end) { > + p.vec_index = 0; > + empty_slots = arg->vec_len - vec_index; > + if (p.vec_len > empty_slots) > + p.vec_len = empty_slots; > + > + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; > + if (__end > end) > + __end = end; > + > + mmap_read_lock(mm); > + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); > + mmap_read_unlock(mm); > + if (!(!ret || ret == -ENOSPC)) > + goto free_data; > + > + __start = __end; > + if (IS_GET_OP(arg) && p.vec_index) { > + if (copy_to_user(&vec[vec_index], p.vec, > + p.vec_index * sizeof(struct page_region))) { > + ret = -EFAULT; > + goto free_data; > + } > + vec_index += p.vec_index; > + } I think you can move copy_to_user() to outside the loop, then call pagemap_scan_deposit() before copy_to_user(), then I think you can drop the ugly export_prev_to_out().. > + } > + ret = export_prev_to_out(&p, vec, &vec_index); > + if (!ret) > + ret = vec_index; > +free_data: > + if (IS_GET_OP(arg)) > + kfree(p.vec); > + > + return ret; > +} > + > +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > +{ > + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; > + struct mm_struct *mm = file->private_data; > + struct pagemap_scan_arg argument; > + > + if (cmd == PAGEMAP_SCAN) { > + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) > + return -EFAULT; > + return do_pagemap_cmd(mm, &argument); > + } > + return -EINVAL; > +} > + > const struct file_operations proc_pagemap_operations = { > .llseek = mem_lseek, /* borrow this */ > .read = pagemap_read, > .open = pagemap_open, > .release = pagemap_release, > + .unlocked_ioctl = pagemap_scan_ioctl, > + .compat_ioctl = pagemap_scan_ioctl, > }; > #endif /* CONFIG_PROC_PAGE_MONITOR */ > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index b7b56871029c..1ae9a8684b48 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; > #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ > RWF_APPEND) > > +/* Pagemap ioctl */ > +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) > + > +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ > +#define PAGE_IS_WRITTEN (1 << 0) > +#define PAGE_IS_FILE (1 << 1) > +#define PAGE_IS_PRESENT (1 << 2) > +#define PAGE_IS_SWAPPED (1 << 3) > + > +/* > + * struct page_region - Page region with bitmap flags > + * @start: Start of the region > + * @len: Length of the region > + * bitmap: Bits sets for the region > + */ > +struct page_region { > + __u64 start; > + __u64 len; > + __u64 bitmap; > +}; > + > +/* > + * struct pagemap_scan_arg - Pagemap ioctl argument > + * @start: Starting address of the region > + * @len: Length of the region (All the pages in this length are included) > + * @vec: Address of page_region struct array for output > + * @vec_len: Length of the page_region struct array > + * @max_pages: Optional max return pages > + * @flags: Flags for the IOCTL > + * @required_mask: Required mask - All of these bits have to be set in the PTE > + * @anyof_mask: Any mask - Any of these bits are set in the PTE > + * @excluded_mask: Exclude mask - None of these bits are set in the PTE > + * @return_mask: Bits that are to be reported in page_region > + */ > +struct pagemap_scan_arg { > + __u64 start; > + __u64 len; > + __u64 vec; > + __u64 vec_len; > + __u32 max_pages; > + __u32 flags; > + __u64 required_mask; > + __u64 anyof_mask; > + __u64 excluded_mask; > + __u64 return_mask; > +}; > + > +/* Special flags */ > +#define PAGEMAP_WP_ENGAGE (1 << 0) > + > #endif /* _UAPI_LINUX_FS_H */ > -- > 2.30.2 >
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: ... Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i fear cant look with precise care at least for some time, hopefully other CRIU guys pick it up. Anyway, here a few comment from a glance. > + > +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, unsigned long addr, > + unsigned int len) > +{ This is a big function and usually it's a flag to not declare it as "inline" until there very serious reson to. > + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); > + bool cpy = true; > + struct page_region *prev = &p->prev; > + > + if (HAS_NO_SPACE(p)) > + return -ENOSPC; > + > + if (p->max_pages && p->found_pages + len >= p->max_pages) > + len = p->max_pages - p->found_pages; > + if (!len) > + return -EINVAL; > + > + if (p->required_mask) > + cpy = ((p->required_mask & cur) == p->required_mask); > + if (cpy && p->anyof_mask) > + cpy = (p->anyof_mask & cur); > + if (cpy && p->excluded_mask) > + cpy = !(p->excluded_mask & cur); > + bitmap = cur & p->return_mask; > + if (cpy && bitmap) { You can exit early here simply if (!cpy || !bitmap) return 0; saving one tab for the code below. > + if ((prev->len) && (prev->bitmap == bitmap) && > + (prev->start + prev->len * PAGE_SIZE == addr)) { > + prev->len += len; > + p->found_pages += len; > + } else if (p->vec_index < p->vec_len) { > + if (prev->len) { > + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > + p->vec_index++; > + } > + prev->start = addr; > + prev->len = len; > + prev->bitmap = bitmap; > + p->found_pages += len; > + } else { > + return -ENOSPC; > + } > + } > + return 0; > +} > + > +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, > + unsigned long *vec_index) > +{ No need for inline either. > + struct page_region *prev = &p->prev; > + > + if (prev->len) { > + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) > + return -EFAULT; > + p->vec_index++; > + (*vec_index)++; > + prev->len = 0; > + } > + return 0; > +} > + > +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ Same, no need for inline. I've a few comments more in my mind will try to collect them tomorrow.
Hi Cyrill, Thank you for your time and review. On 2/9/23 3:22 AM, Cyrill Gorcunov wrote: > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: > ... > Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i > fear cant look with precise care at least for some time, hopefully other CRIU guys > pick it up. Anyway, here a few comment from a glance. > >> + >> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, >> + struct pagemap_scan_private *p, unsigned long addr, >> + unsigned int len) >> +{ > > This is a big function and usually it's a flag to not declare it as "inline" until > there very serious reson to. I'll remove all these inline in next revision. > >> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); >> + bool cpy = true; >> + struct page_region *prev = &p->prev; >> + >> + if (HAS_NO_SPACE(p)) >> + return -ENOSPC; >> + >> + if (p->max_pages && p->found_pages + len >= p->max_pages) >> + len = p->max_pages - p->found_pages; >> + if (!len) >> + return -EINVAL; >> + >> + if (p->required_mask) >> + cpy = ((p->required_mask & cur) == p->required_mask); >> + if (cpy && p->anyof_mask) >> + cpy = (p->anyof_mask & cur); >> + if (cpy && p->excluded_mask) >> + cpy = !(p->excluded_mask & cur); >> + bitmap = cur & p->return_mask; >> + if (cpy && bitmap) { > > You can exit early here simply > > if (!cpy || !bitmap) > return 0; I'm avoiding an extra return here. > > saving one tab for the code below. > >> + if ((prev->len) && (prev->bitmap == bitmap) && >> + (prev->start + prev->len * PAGE_SIZE == addr)) { >> + prev->len += len; >> + p->found_pages += len; >> + } else if (p->vec_index < p->vec_len) { >> + if (prev->len) { >> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >> + p->vec_index++; >> + } >> + prev->start = addr; >> + prev->len = len; >> + prev->bitmap = bitmap; >> + p->found_pages += len; >> + } else { >> + return -ENOSPC; >> + } >> + } >> + return 0; >> +} >> + >> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, >> + unsigned long *vec_index) >> +{ > > No need for inline either. > >> + struct page_region *prev = &p->prev; >> + >> + if (prev->len) { >> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) >> + return -EFAULT; >> + p->vec_index++; >> + (*vec_index)++; >> + prev->len = 0; >> + } >> + return 0; >> +} >> + >> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >> + unsigned long end, struct mm_walk *walk) >> +{ > > Same, no need for inline. I've a few comments more in my mind will try to > collect them tomorrow. Your review would be much appreciated.
On 2/9/23 3:15 AM, Peter Xu wrote: > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are supported >> in this ioctl: >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >> (PAGE_IS_SWAPPED). >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which >> pages have been written-to. >> - Find pages which have been written-to and write protect the pages >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) >> >> To get information about which pages have been written-to and/or write >> protect the pages, following must be performed first in order: >> - The userfaultfd file descriptor is created with userfaultfd syscall. >> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. >> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode >> through UFFDIO_REGISTER IOCTL. >> Then the any part of the registered memory or the whole memory region >> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or >> PAGEMAP_SCAN IOCTL. >> >> struct pagemap_scan_args is used as the argument of the IOCTL. In this >> struct: >> - The range is specified through start and len. >> - The output buffer of struct page_region array and size is specified as >> vec and vec_len. >> - The optional maximum requested pages are specified in the max_pages. >> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE >> is the only added flag at this time. >> - The masks are specified in required_mask, anyof_mask, excluded_ mask >> and return_mask. >> >> This IOCTL can be extended to get information about more PTE bits. This >> IOCTL doesn't support hugetlbs at the moment. No information about >> hugetlb can be obtained. This patch has evolved from a basic patch from >> Gabriel Krisman Bertazi. >> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> >> --- >> Changes in v10: >> - move changes in tools/include/uapi/linux/fs.h to separate patch >> - update commit message >> >> Change in v8: >> - Correct is_pte_uffd_wp() >> - Improve readability and error checks >> - Remove some un-needed code >> >> Changes in v7: >> - Rebase on top of latest next >> - Fix some corner cases >> - Base soft-dirty on the uffd wp async >> - Update the terminologies >> - Optimize the memory usage inside the ioctl >> >> Changes in v6: >> - Rename variables and update comments >> - Make IOCTL independent of soft_dirty config >> - Change masks and bitmap type to _u64 >> - Improve code quality >> >> Changes in v5: >> - Remove tlb flushing even for clear operation >> >> Changes in v4: >> - Update the interface and implementation >> >> Changes in v3: >> - Tighten the user-kernel interface by using explicit types and add more >> error checking >> >> Changes in v2: >> - Convert the interface from syscall to ioctl >> - Remove pidfd support as it doesn't make sense in ioctl >> --- >> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ >> include/uapi/linux/fs.h | 50 +++++++ >> 2 files changed, 340 insertions(+) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index e35a0398db63..c6bde19d63d9 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -19,6 +19,7 @@ >> #include <linux/shmem_fs.h> >> #include <linux/uaccess.h> >> #include <linux/pkeys.h> >> +#include <linux/minmax.h> >> >> #include <asm/elf.h> >> #include <asm/tlb.h> >> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, >> } >> #endif >> >> +static inline bool is_pte_uffd_wp(pte_t pte) >> +{ >> + if ((pte_present(pte) && pte_uffd_wp(pte)) || >> + (pte_swp_uffd_wp_any(pte))) >> + return true; >> + return false; > > Sorry I should have mentioned this earlier: you can directly return here. No problem at all. I'm replacing these two helper functions with following in next version so that !present pages don't show as dirty: static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); } static inline bool is_pmd_written(pmd_t pmd) { if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) return false; return (pmd_present(pmd) || is_swap_pmd(pmd)); } > > return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte); > >> +} >> + >> +static inline bool is_pmd_uffd_wp(pmd_t pmd) >> +{ >> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || >> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) >> + return true; >> + return false; > > Same here. > >> +} >> + >> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) >> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, >> unsigned long addr, pmd_t *pmdp) >> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ >> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) >> +#define IS_GET_OP(a) (a->vec) >> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) >> + >> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ >> + (wt | file << 1 | present << 2 | swap << 3) >> +#define IS_WT_REQUIRED(a) \ >> + ((a->required_mask & PAGE_IS_WRITTEN) || \ >> + (a->anyof_mask & PAGE_IS_WRITTEN)) >> + >> +struct pagemap_scan_private { >> + struct page_region *vec; >> + struct page_region prev; >> + unsigned long vec_len, vec_index; >> + unsigned int max_pages, found_pages, flags; >> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; >> +}; >> + >> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + >> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) > > Should this be: > > (IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma))) > > Instead? Correct. I'll fix this. > >> + return -EPERM; >> + if (vma->vm_flags & VM_PFNMAP) >> + return 1; >> + return 0; >> +} >> + >> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, >> + struct pagemap_scan_private *p, unsigned long addr, >> + unsigned int len) >> +{ >> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); >> + bool cpy = true; >> + struct page_region *prev = &p->prev; > > Nit: switch the above two lines? I'll fix this. > >> + >> + if (HAS_NO_SPACE(p)) >> + return -ENOSPC; >> + >> + if (p->max_pages && p->found_pages + len >= p->max_pages) >> + len = p->max_pages - p->found_pages; > > If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC? Length calculation is happening in the funtions calling this function. I'll move this out of here to make things logically better. > >> + if (!len) >> + return -EINVAL; >> + >> + if (p->required_mask) >> + cpy = ((p->required_mask & cur) == p->required_mask); >> + if (cpy && p->anyof_mask) >> + cpy = (p->anyof_mask & cur); >> + if (cpy && p->excluded_mask) >> + cpy = !(p->excluded_mask & cur); >> + bitmap = cur & p->return_mask; >> + if (cpy && bitmap) { >> + if ((prev->len) && (prev->bitmap == bitmap) && >> + (prev->start + prev->len * PAGE_SIZE == addr)) { >> + prev->len += len; >> + p->found_pages += len; >> + } else if (p->vec_index < p->vec_len) { >> + if (prev->len) { >> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >> + p->vec_index++; >> + } > > IIUC you can have: > > int pagemap_scan_deposit(p) > { > if (p->vec_index >= p->vec_len) > return -ENOSPC; > > if (p->prev->len) { > memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > p->vec_index++; > } > > return 0; > } > > Then call it here. I think it can also be called below to replace > export_prev_to_out(). No this isn't possible. We fill up prev until the next range doesn't merge with it. At that point, we put prev into the output buffer and new range is put into prev. Now that we have shifted to smaller page walks of <= 512 entries. We want to visit all ranges before finally putting the prev to output. Sorry to have this some what complex method. The problem is that we want to merge the consective matching regions into one entry in the output. So to achieve this among multiple different page walks, the prev is being used. Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 having length of 1024 pages and all of the memory has been written. walk_page_range() will be called 2 times. In the first call, prev will be set having length of 512. In second call, prev will be updated to 1024 as the previous range stored in prev could be extended. After this, the prev will be stored to the user output buffer consuming only 1 struct of page_range. If we store prev back to output memory in every walk_page_range() call, we wouldn't get 1 struct of page_range with length 1024. Instead we would get 2 elements of page_range structs with half the length. > >> + prev->start = addr; >> + prev->len = len; >> + prev->bitmap = bitmap; >> + p->found_pages += len; >> + } else { >> + return -ENOSPC; >> + } >> + } >> + return 0; >> +} >> + >> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, >> + unsigned long *vec_index) >> +{ >> + struct page_region *prev = &p->prev; >> + >> + if (prev->len) { >> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) >> + return -EFAULT; >> + p->vec_index++; >> + (*vec_index)++; >> + prev->len = 0; >> + } >> + return 0; >> +} >> + >> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >> + unsigned long end, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + unsigned long addr = end; > > This assignment is useless? No, this assignement gets used when only the WP_ENGAGE operation is used on normal size pages. > >> + spinlock_t *ptl; >> + int ret = 0; >> + pte_t *pte; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + ptl = pmd_trans_huge_lock(pmd, vma); >> + if (ptl) { >> + bool pmd_wt; >> + >> + pmd_wt = !is_pmd_uffd_wp(*pmd); >> + /* >> + * Break huge page into small pages if operation needs to be performed is >> + * on a portion of the huge page. >> + */ >> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { >> + spin_unlock(ptl); >> + split_huge_pmd(vma, pmd, start); >> + goto process_smaller_pages; >> + } >> + if (IS_GET_OP(p)) >> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), >> + is_swap_pmd(*pmd), p, start, >> + (end - start)/PAGE_SIZE); >> + spin_unlock(ptl); >> + if (!ret) { >> + if (pmd_wt && IS_WP_ENGAGE_OP(p)) >> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); >> + } >> + return ret; >> + } >> +process_smaller_pages: >> + if (pmd_trans_unstable(pmd)) >> + return 0; >> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> + >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); >> + if (IS_GET_OP(p)) { >> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { >> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, >> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); >> + if (ret) >> + break; >> + } >> + } >> + pte_unmap_unlock(pte - 1, ptl); >> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) >> + uffd_wp_range(walk->mm, vma, start, addr - start, true); >> + >> + cond_resched(); >> + return ret; >> +} >> + >> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, >> + struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + int ret = 0; >> + >> + if (vma) >> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, >> + (end - addr)/PAGE_SIZE); >> + return ret; >> +} >> + >> +/* No hugetlb support is present. */ >> +static const struct mm_walk_ops pagemap_scan_ops = { >> + .test_walk = pagemap_scan_test_walk, >> + .pmd_entry = pagemap_scan_pmd_entry, >> + .pte_hole = pagemap_scan_pte_hole, >> +}; >> + >> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) >> +{ >> + unsigned long empty_slots, vec_index = 0; >> + unsigned long __user start, end; >> + unsigned long __start, __end; >> + struct page_region __user *vec; >> + struct pagemap_scan_private p; >> + int ret = 0; >> + >> + start = (unsigned long)untagged_addr(arg->start); >> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); >> + >> + /* Validate memory ranges */ >> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) >> + return -EINVAL; >> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || >> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) >> + return -EINVAL; >> + >> + /* Detect illegal flags and masks */ >> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || >> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || >> + (arg->return_mask & ~PAGEMAP_BITS_ALL)) >> + return -EINVAL; >> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || >> + !arg->return_mask)) >> + return -EINVAL; >> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ >> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || >> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) >> + return -EINVAL; > > I think you said you'll clean this up a bit. I don't think so.. You had showed a really clean way to put all these error checking conditions. But I wasn't able to put the current error checking conditions in that much nice way. I'd done at least something to make them look better. Sorry, I'll revisit and try to come up with easier to follow error checking conditions.
On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote: > On 2/9/23 3:15 AM, Peter Xu wrote: > > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: > >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > >> the info about page table entries. The following operations are supported > >> in this ioctl: > >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > >> (PAGE_IS_SWAPPED). > >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which > >> pages have been written-to. > >> - Find pages which have been written-to and write protect the pages > >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > >> > >> To get information about which pages have been written-to and/or write > >> protect the pages, following must be performed first in order: > >> - The userfaultfd file descriptor is created with userfaultfd syscall. > >> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. > >> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode > >> through UFFDIO_REGISTER IOCTL. > >> Then the any part of the registered memory or the whole memory region > >> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or > >> PAGEMAP_SCAN IOCTL. > >> > >> struct pagemap_scan_args is used as the argument of the IOCTL. In this > >> struct: > >> - The range is specified through start and len. > >> - The output buffer of struct page_region array and size is specified as > >> vec and vec_len. > >> - The optional maximum requested pages are specified in the max_pages. > >> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE > >> is the only added flag at this time. > >> - The masks are specified in required_mask, anyof_mask, excluded_ mask > >> and return_mask. > >> > >> This IOCTL can be extended to get information about more PTE bits. This > >> IOCTL doesn't support hugetlbs at the moment. No information about > >> hugetlb can be obtained. This patch has evolved from a basic patch from > >> Gabriel Krisman Bertazi. > >> > >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > >> --- > >> Changes in v10: > >> - move changes in tools/include/uapi/linux/fs.h to separate patch > >> - update commit message > >> > >> Change in v8: > >> - Correct is_pte_uffd_wp() > >> - Improve readability and error checks > >> - Remove some un-needed code > >> > >> Changes in v7: > >> - Rebase on top of latest next > >> - Fix some corner cases > >> - Base soft-dirty on the uffd wp async > >> - Update the terminologies > >> - Optimize the memory usage inside the ioctl > >> > >> Changes in v6: > >> - Rename variables and update comments > >> - Make IOCTL independent of soft_dirty config > >> - Change masks and bitmap type to _u64 > >> - Improve code quality > >> > >> Changes in v5: > >> - Remove tlb flushing even for clear operation > >> > >> Changes in v4: > >> - Update the interface and implementation > >> > >> Changes in v3: > >> - Tighten the user-kernel interface by using explicit types and add more > >> error checking > >> > >> Changes in v2: > >> - Convert the interface from syscall to ioctl > >> - Remove pidfd support as it doesn't make sense in ioctl > >> --- > >> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ > >> include/uapi/linux/fs.h | 50 +++++++ > >> 2 files changed, 340 insertions(+) > >> > >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > >> index e35a0398db63..c6bde19d63d9 100644 > >> --- a/fs/proc/task_mmu.c > >> +++ b/fs/proc/task_mmu.c > >> @@ -19,6 +19,7 @@ > >> #include <linux/shmem_fs.h> > >> #include <linux/uaccess.h> > >> #include <linux/pkeys.h> > >> +#include <linux/minmax.h> > >> > >> #include <asm/elf.h> > >> #include <asm/tlb.h> > >> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, > >> } > >> #endif > >> > >> +static inline bool is_pte_uffd_wp(pte_t pte) > >> +{ > >> + if ((pte_present(pte) && pte_uffd_wp(pte)) || > >> + (pte_swp_uffd_wp_any(pte))) > >> + return true; > >> + return false; > > > > Sorry I should have mentioned this earlier: you can directly return here. > No problem at all. I'm replacing these two helper functions with following > in next version so that !present pages don't show as dirty: > > static inline bool is_pte_written(pte_t pte) > { > if ((pte_present(pte) && pte_uffd_wp(pte)) || > (pte_swp_uffd_wp_any(pte))) > return false; > return (pte_present(pte) || is_swap_pte(pte)); > } Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)? The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none. Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for? > > static inline bool is_pmd_written(pmd_t pmd) > { > if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || > (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) > return false; > return (pmd_present(pmd) || is_swap_pmd(pmd)); > } [...] > >> + bitmap = cur & p->return_mask; > >> + if (cpy && bitmap) { > >> + if ((prev->len) && (prev->bitmap == bitmap) && > >> + (prev->start + prev->len * PAGE_SIZE == addr)) { > >> + prev->len += len; > >> + p->found_pages += len; > >> + } else if (p->vec_index < p->vec_len) { > >> + if (prev->len) { > >> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > >> + p->vec_index++; > >> + } > > > > IIUC you can have: > > > > int pagemap_scan_deposit(p) > > { > > if (p->vec_index >= p->vec_len) > > return -ENOSPC; > > > > if (p->prev->len) { > > memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > > p->vec_index++; > > } > > > > return 0; > > } > > > > Then call it here. I think it can also be called below to replace > > export_prev_to_out(). > No this isn't possible. We fill up prev until the next range doesn't merge > with it. At that point, we put prev into the output buffer and new range is > put into prev. Now that we have shifted to smaller page walks of <= 512 > entries. We want to visit all ranges before finally putting the prev to > output. Sorry to have this some what complex method. The problem is that we > want to merge the consective matching regions into one entry in the output. > So to achieve this among multiple different page walks, the prev is being used. > > Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 > having length of 1024 pages and all of the memory has been written. > walk_page_range() will be called 2 times. In the first call, prev will be > set having length of 512. In second call, prev will be updated to 1024 as > the previous range stored in prev could be extended. After this, the prev > will be stored to the user output buffer consuming only 1 struct of page_range. > > If we store prev back to output memory in every walk_page_range() call, we > wouldn't get 1 struct of page_range with length 1024. Instead we would get > 2 elements of page_range structs with half the length. I didn't mean to merge PREV for each pgtable walk. What I meant is I think with such a pagemap_scan_deposit() you can rewrite it as: if (cpy && bitmap) { if ((prev->len) && (prev->bitmap == bitmap) && (prev->start + prev->len * PAGE_SIZE == addr)) { prev->len += len; p->found_pages += len; } else { if (pagemap_scan_deposit(p)) return -ENOSPC; prev->start = addr; prev->len = len; prev->bitmap = bitmap; p->found_pages += len; } } Then you can reuse pagemap_scan_deposit() when before returning to userspace, just to flush PREV to p->vec properly in a single helper. It also makes the code slightly easier to read.
On 2/14/23 2:42 AM, Peter Xu wrote: > On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote: >> On 2/9/23 3:15 AM, Peter Xu wrote: >>> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: >>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >>>> the info about page table entries. The following operations are supported >>>> in this ioctl: >>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >>>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >>>> (PAGE_IS_SWAPPED). >>>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which >>>> pages have been written-to. >>>> - Find pages which have been written-to and write protect the pages >>>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) >>>> >>>> To get information about which pages have been written-to and/or write >>>> protect the pages, following must be performed first in order: >>>> - The userfaultfd file descriptor is created with userfaultfd syscall. >>>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. >>>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode >>>> through UFFDIO_REGISTER IOCTL. >>>> Then the any part of the registered memory or the whole memory region >>>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or >>>> PAGEMAP_SCAN IOCTL. >>>> >>>> struct pagemap_scan_args is used as the argument of the IOCTL. In this >>>> struct: >>>> - The range is specified through start and len. >>>> - The output buffer of struct page_region array and size is specified as >>>> vec and vec_len. >>>> - The optional maximum requested pages are specified in the max_pages. >>>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE >>>> is the only added flag at this time. >>>> - The masks are specified in required_mask, anyof_mask, excluded_ mask >>>> and return_mask. >>>> >>>> This IOCTL can be extended to get information about more PTE bits. This >>>> IOCTL doesn't support hugetlbs at the moment. No information about >>>> hugetlb can be obtained. This patch has evolved from a basic patch from >>>> Gabriel Krisman Bertazi. >>>> >>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> >>>> --- >>>> Changes in v10: >>>> - move changes in tools/include/uapi/linux/fs.h to separate patch >>>> - update commit message >>>> >>>> Change in v8: >>>> - Correct is_pte_uffd_wp() >>>> - Improve readability and error checks >>>> - Remove some un-needed code >>>> >>>> Changes in v7: >>>> - Rebase on top of latest next >>>> - Fix some corner cases >>>> - Base soft-dirty on the uffd wp async >>>> - Update the terminologies >>>> - Optimize the memory usage inside the ioctl >>>> >>>> Changes in v6: >>>> - Rename variables and update comments >>>> - Make IOCTL independent of soft_dirty config >>>> - Change masks and bitmap type to _u64 >>>> - Improve code quality >>>> >>>> Changes in v5: >>>> - Remove tlb flushing even for clear operation >>>> >>>> Changes in v4: >>>> - Update the interface and implementation >>>> >>>> Changes in v3: >>>> - Tighten the user-kernel interface by using explicit types and add more >>>> error checking >>>> >>>> Changes in v2: >>>> - Convert the interface from syscall to ioctl >>>> - Remove pidfd support as it doesn't make sense in ioctl >>>> --- >>>> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ >>>> include/uapi/linux/fs.h | 50 +++++++ >>>> 2 files changed, 340 insertions(+) >>>> >>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>>> index e35a0398db63..c6bde19d63d9 100644 >>>> --- a/fs/proc/task_mmu.c >>>> +++ b/fs/proc/task_mmu.c >>>> @@ -19,6 +19,7 @@ >>>> #include <linux/shmem_fs.h> >>>> #include <linux/uaccess.h> >>>> #include <linux/pkeys.h> >>>> +#include <linux/minmax.h> >>>> >>>> #include <asm/elf.h> >>>> #include <asm/tlb.h> >>>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, >>>> } >>>> #endif >>>> >>>> +static inline bool is_pte_uffd_wp(pte_t pte) >>>> +{ >>>> + if ((pte_present(pte) && pte_uffd_wp(pte)) || >>>> + (pte_swp_uffd_wp_any(pte))) >>>> + return true; >>>> + return false; >>> >>> Sorry I should have mentioned this earlier: you can directly return here. >> No problem at all. I'm replacing these two helper functions with following >> in next version so that !present pages don't show as dirty: >> >> static inline bool is_pte_written(pte_t pte) >> { >> if ((pte_present(pte) && pte_uffd_wp(pte)) || >> (pte_swp_uffd_wp_any(pte))) >> return false; >> return (pte_present(pte) || is_swap_pte(pte)); >> } > > Could you explain why you don't want to return dirty for !present? A page > can be written then swapped out. Don't you want to know that happened > (from dirty tracking POV)? > > The code looks weird to me too.. We only have three types of ptes: (1) > present, (2) swap, (3) none. > > Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is > that what you're really looking for? Yes, this is what I've been trying to do. I'll use !pte_none() to make it simpler. > >> >> static inline bool is_pmd_written(pmd_t pmd) >> { >> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || >> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) >> return false; >> return (pmd_present(pmd) || is_swap_pmd(pmd)); >> } > > [...] > >>>> + bitmap = cur & p->return_mask; >>>> + if (cpy && bitmap) { >>>> + if ((prev->len) && (prev->bitmap == bitmap) && >>>> + (prev->start + prev->len * PAGE_SIZE == addr)) { >>>> + prev->len += len; >>>> + p->found_pages += len; >>>> + } else if (p->vec_index < p->vec_len) { >>>> + if (prev->len) { >>>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >>>> + p->vec_index++; >>>> + } >>> >>> IIUC you can have: >>> >>> int pagemap_scan_deposit(p) >>> { >>> if (p->vec_index >= p->vec_len) >>> return -ENOSPC; >>> >>> if (p->prev->len) { >>> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >>> p->vec_index++; >>> } >>> >>> return 0; >>> } >>> >>> Then call it here. I think it can also be called below to replace >>> export_prev_to_out(). >> No this isn't possible. We fill up prev until the next range doesn't merge >> with it. At that point, we put prev into the output buffer and new range is >> put into prev. Now that we have shifted to smaller page walks of <= 512 >> entries. We want to visit all ranges before finally putting the prev to >> output. Sorry to have this some what complex method. The problem is that we >> want to merge the consective matching regions into one entry in the output. >> So to achieve this among multiple different page walks, the prev is being used. >> >> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 >> having length of 1024 pages and all of the memory has been written. >> walk_page_range() will be called 2 times. In the first call, prev will be >> set having length of 512. In second call, prev will be updated to 1024 as >> the previous range stored in prev could be extended. After this, the prev >> will be stored to the user output buffer consuming only 1 struct of page_range. >> >> If we store prev back to output memory in every walk_page_range() call, we >> wouldn't get 1 struct of page_range with length 1024. Instead we would get >> 2 elements of page_range structs with half the length. > > I didn't mean to merge PREV for each pgtable walk. What I meant is I think > with such a pagemap_scan_deposit() you can rewrite it as: > > if (cpy && bitmap) { > if ((prev->len) && (prev->bitmap == bitmap) && > (prev->start + prev->len * PAGE_SIZE == addr)) { > prev->len += len; > p->found_pages += len; > } else { > if (pagemap_scan_deposit(p)) > return -ENOSPC; > prev->start = addr; > prev->len = len; > prev->bitmap = bitmap; > p->found_pages += len; > } > } > > Then you can reuse pagemap_scan_deposit() when before returning to > userspace, just to flush PREV to p->vec properly in a single helper. > It also makes the code slightly easier to read. Yeah, this would have worked as you have described. But in pagemap_scan_output(), we are flushing prev to p->vec. But later in export_prev_to_out() we need to flush prev to user_memory directly. >
On Tue, Feb 14, 2023 at 12:57:21PM +0500, Muhammad Usama Anjum wrote: > On 2/14/23 2:42 AM, Peter Xu wrote: > > On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote: > >> On 2/9/23 3:15 AM, Peter Xu wrote: > >>> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: > >>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > >>>> the info about page table entries. The following operations are supported > >>>> in this ioctl: > >>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > >>>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > >>>> (PAGE_IS_SWAPPED). > >>>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which > >>>> pages have been written-to. > >>>> - Find pages which have been written-to and write protect the pages > >>>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > >>>> > >>>> To get information about which pages have been written-to and/or write > >>>> protect the pages, following must be performed first in order: > >>>> - The userfaultfd file descriptor is created with userfaultfd syscall. > >>>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. > >>>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode > >>>> through UFFDIO_REGISTER IOCTL. > >>>> Then the any part of the registered memory or the whole memory region > >>>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or > >>>> PAGEMAP_SCAN IOCTL. > >>>> > >>>> struct pagemap_scan_args is used as the argument of the IOCTL. In this > >>>> struct: > >>>> - The range is specified through start and len. > >>>> - The output buffer of struct page_region array and size is specified as > >>>> vec and vec_len. > >>>> - The optional maximum requested pages are specified in the max_pages. > >>>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE > >>>> is the only added flag at this time. > >>>> - The masks are specified in required_mask, anyof_mask, excluded_ mask > >>>> and return_mask. > >>>> > >>>> This IOCTL can be extended to get information about more PTE bits. This > >>>> IOCTL doesn't support hugetlbs at the moment. No information about > >>>> hugetlb can be obtained. This patch has evolved from a basic patch from > >>>> Gabriel Krisman Bertazi. > >>>> > >>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > >>>> --- > >>>> Changes in v10: > >>>> - move changes in tools/include/uapi/linux/fs.h to separate patch > >>>> - update commit message > >>>> > >>>> Change in v8: > >>>> - Correct is_pte_uffd_wp() > >>>> - Improve readability and error checks > >>>> - Remove some un-needed code > >>>> > >>>> Changes in v7: > >>>> - Rebase on top of latest next > >>>> - Fix some corner cases > >>>> - Base soft-dirty on the uffd wp async > >>>> - Update the terminologies > >>>> - Optimize the memory usage inside the ioctl > >>>> > >>>> Changes in v6: > >>>> - Rename variables and update comments > >>>> - Make IOCTL independent of soft_dirty config > >>>> - Change masks and bitmap type to _u64 > >>>> - Improve code quality > >>>> > >>>> Changes in v5: > >>>> - Remove tlb flushing even for clear operation > >>>> > >>>> Changes in v4: > >>>> - Update the interface and implementation > >>>> > >>>> Changes in v3: > >>>> - Tighten the user-kernel interface by using explicit types and add more > >>>> error checking > >>>> > >>>> Changes in v2: > >>>> - Convert the interface from syscall to ioctl > >>>> - Remove pidfd support as it doesn't make sense in ioctl > >>>> --- > >>>> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ > >>>> include/uapi/linux/fs.h | 50 +++++++ > >>>> 2 files changed, 340 insertions(+) > >>>> > >>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > >>>> index e35a0398db63..c6bde19d63d9 100644 > >>>> --- a/fs/proc/task_mmu.c > >>>> +++ b/fs/proc/task_mmu.c > >>>> @@ -19,6 +19,7 @@ > >>>> #include <linux/shmem_fs.h> > >>>> #include <linux/uaccess.h> > >>>> #include <linux/pkeys.h> > >>>> +#include <linux/minmax.h> > >>>> > >>>> #include <asm/elf.h> > >>>> #include <asm/tlb.h> > >>>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, > >>>> } > >>>> #endif > >>>> > >>>> +static inline bool is_pte_uffd_wp(pte_t pte) > >>>> +{ > >>>> + if ((pte_present(pte) && pte_uffd_wp(pte)) || > >>>> + (pte_swp_uffd_wp_any(pte))) > >>>> + return true; > >>>> + return false; > >>> > >>> Sorry I should have mentioned this earlier: you can directly return here. > >> No problem at all. I'm replacing these two helper functions with following > >> in next version so that !present pages don't show as dirty: > >> > >> static inline bool is_pte_written(pte_t pte) > >> { > >> if ((pte_present(pte) && pte_uffd_wp(pte)) || > >> (pte_swp_uffd_wp_any(pte))) > >> return false; > >> return (pte_present(pte) || is_swap_pte(pte)); > >> } > > > > Could you explain why you don't want to return dirty for !present? A page > > can be written then swapped out. Don't you want to know that happened > > (from dirty tracking POV)? > > > > The code looks weird to me too.. We only have three types of ptes: (1) > > present, (2) swap, (3) none. > > > > Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is > > that what you're really looking for? > Yes, this is what I've been trying to do. I'll use !pte_none() to make it > simpler. Ah I think I see what you wanted to do now.. But I'm afraid it won't work for all cases. So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't persist on anon (but none) ptes, then we got it lost and we cannot identify it from pages being written. Your solution will solve problem for anonymous, but I think it'll break file memories. Example: Consider one shmem page that got mapped, write protected (using UFFDIO_WP ioctl), written again (removing uffd-wp bit automatically), then zapped. The pte will be pte_none() but it's actually written, afaiu. Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need to install pte markers for anonymous too (then it will work similarly like shmem/hugetlbfs, that we'll report writting to zero pages), then you'll need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think you can keep using the old check and it should start to work. Please let me know if my understanding is correct above. I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some patch at the meantime. That's something we wanted before too, when the app cares about zero pages on anon. We used to populate the pages before doing ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag should be more efficient. > > > > >> > >> static inline bool is_pmd_written(pmd_t pmd) > >> { > >> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || > >> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) > >> return false; > >> return (pmd_present(pmd) || is_swap_pmd(pmd)); > >> } > > > > [...] > > > >>>> + bitmap = cur & p->return_mask; > >>>> + if (cpy && bitmap) { > >>>> + if ((prev->len) && (prev->bitmap == bitmap) && > >>>> + (prev->start + prev->len * PAGE_SIZE == addr)) { > >>>> + prev->len += len; > >>>> + p->found_pages += len; > >>>> + } else if (p->vec_index < p->vec_len) { > >>>> + if (prev->len) { > >>>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > >>>> + p->vec_index++; > >>>> + } > >>> > >>> IIUC you can have: > >>> > >>> int pagemap_scan_deposit(p) > >>> { > >>> if (p->vec_index >= p->vec_len) > >>> return -ENOSPC; > >>> > >>> if (p->prev->len) { > >>> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > >>> p->vec_index++; > >>> } > >>> > >>> return 0; > >>> } > >>> > >>> Then call it here. I think it can also be called below to replace > >>> export_prev_to_out(). > >> No this isn't possible. We fill up prev until the next range doesn't merge > >> with it. At that point, we put prev into the output buffer and new range is > >> put into prev. Now that we have shifted to smaller page walks of <= 512 > >> entries. We want to visit all ranges before finally putting the prev to > >> output. Sorry to have this some what complex method. The problem is that we > >> want to merge the consective matching regions into one entry in the output. > >> So to achieve this among multiple different page walks, the prev is being used. > >> > >> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 > >> having length of 1024 pages and all of the memory has been written. > >> walk_page_range() will be called 2 times. In the first call, prev will be > >> set having length of 512. In second call, prev will be updated to 1024 as > >> the previous range stored in prev could be extended. After this, the prev > >> will be stored to the user output buffer consuming only 1 struct of page_range. > >> > >> If we store prev back to output memory in every walk_page_range() call, we > >> wouldn't get 1 struct of page_range with length 1024. Instead we would get > >> 2 elements of page_range structs with half the length. > > > > I didn't mean to merge PREV for each pgtable walk. What I meant is I think > > with such a pagemap_scan_deposit() you can rewrite it as: > > > > if (cpy && bitmap) { > > if ((prev->len) && (prev->bitmap == bitmap) && > > (prev->start + prev->len * PAGE_SIZE == addr)) { > > prev->len += len; > > p->found_pages += len; > > } else { > > if (pagemap_scan_deposit(p)) > > return -ENOSPC; > > prev->start = addr; > > prev->len = len; > > prev->bitmap = bitmap; > > p->found_pages += len; > > } > > } > > > > Then you can reuse pagemap_scan_deposit() when before returning to > > userspace, just to flush PREV to p->vec properly in a single helper. > > It also makes the code slightly easier to read. > Yeah, this would have worked as you have described. But in > pagemap_scan_output(), we are flushing prev to p->vec. But later in > export_prev_to_out() we need to flush prev to user_memory directly. I think there's a loop to copy_to_user(). Could you use the new helper so the copy_to_user() loop will work without export_prev_to_out()? I really hope we can get rid of export_prev_to_out(). Thanks,
On 2/15/23 1:59 AM, Peter Xu wrote: [..] >>>> static inline bool is_pte_written(pte_t pte) >>>> { >>>> if ((pte_present(pte) && pte_uffd_wp(pte)) || >>>> (pte_swp_uffd_wp_any(pte))) >>>> return false; >>>> return (pte_present(pte) || is_swap_pte(pte)); >>>> } >>> >>> Could you explain why you don't want to return dirty for !present? A page >>> can be written then swapped out. Don't you want to know that happened >>> (from dirty tracking POV)? >>> >>> The code looks weird to me too.. We only have three types of ptes: (1) >>> present, (2) swap, (3) none. >>> >>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is >>> that what you're really looking for? >> Yes, this is what I've been trying to do. I'll use !pte_none() to make it >> simpler. > > Ah I think I see what you wanted to do now.. But I'm afraid it won't work > for all cases. > > So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't > persist on anon (but none) ptes, then we got it lost and we cannot identify > it from pages being written. Your solution will solve problem for > anonymous, but I think it'll break file memories. > > Example: > > Consider one shmem page that got mapped, write protected (using UFFDIO_WP > ioctl), written again (removing uffd-wp bit automatically), then zapped. > The pte will be pte_none() but it's actually written, afaiu. > > Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need > to install pte markers for anonymous too (then it will work similarly like > shmem/hugetlbfs, that we'll report writting to zero pages), then you'll > need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think > you can keep using the old check and it should start to work. > > Please let me know if my understanding is correct above. Thank you for identifying it. Your understanding seems on point. I'll have research things up about PTE Markers. I'm looking at your patches about it [1]. Can you refer me to "mm alignment sessions" discussion in form of presentation or if any transcript is available? > > I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some > patch at the meantime. That's something we wanted before too, when the app > cares about zero pages on anon. We used to populate the pages before doing > ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag > should be more efficient. Is this discussion public? For what application you were looking into this? I'll dig down to see how can I contribute to it. > >> >>> >>>> >>>> static inline bool is_pmd_written(pmd_t pmd) >>>> { >>>> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || >>>> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) >>>> return false; >>>> return (pmd_present(pmd) || is_swap_pmd(pmd)); >>>> } >>> >>> [...] >>> >>>>>> + bitmap = cur & p->return_mask; >>>>>> + if (cpy && bitmap) { >>>>>> + if ((prev->len) && (prev->bitmap == bitmap) && >>>>>> + (prev->start + prev->len * PAGE_SIZE == addr)) { >>>>>> + prev->len += len; >>>>>> + p->found_pages += len; >>>>>> + } else if (p->vec_index < p->vec_len) { >>>>>> + if (prev->len) { >>>>>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >>>>>> + p->vec_index++; >>>>>> + } >>>>> >>>>> IIUC you can have: >>>>> >>>>> int pagemap_scan_deposit(p) >>>>> { >>>>> if (p->vec_index >= p->vec_len) >>>>> return -ENOSPC; >>>>> >>>>> if (p->prev->len) { >>>>> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >>>>> p->vec_index++; >>>>> } >>>>> >>>>> return 0; >>>>> } >>>>> >>>>> Then call it here. I think it can also be called below to replace >>>>> export_prev_to_out(). >>>> No this isn't possible. We fill up prev until the next range doesn't merge >>>> with it. At that point, we put prev into the output buffer and new range is >>>> put into prev. Now that we have shifted to smaller page walks of <= 512 >>>> entries. We want to visit all ranges before finally putting the prev to >>>> output. Sorry to have this some what complex method. The problem is that we >>>> want to merge the consective matching regions into one entry in the output. >>>> So to achieve this among multiple different page walks, the prev is being used. >>>> >>>> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 >>>> having length of 1024 pages and all of the memory has been written. >>>> walk_page_range() will be called 2 times. In the first call, prev will be >>>> set having length of 512. In second call, prev will be updated to 1024 as >>>> the previous range stored in prev could be extended. After this, the prev >>>> will be stored to the user output buffer consuming only 1 struct of page_range. >>>> >>>> If we store prev back to output memory in every walk_page_range() call, we >>>> wouldn't get 1 struct of page_range with length 1024. Instead we would get >>>> 2 elements of page_range structs with half the length. >>> >>> I didn't mean to merge PREV for each pgtable walk. What I meant is I think >>> with such a pagemap_scan_deposit() you can rewrite it as: >>> >>> if (cpy && bitmap) { >>> if ((prev->len) && (prev->bitmap == bitmap) && >>> (prev->start + prev->len * PAGE_SIZE == addr)) { >>> prev->len += len; >>> p->found_pages += len; >>> } else { >>> if (pagemap_scan_deposit(p)) >>> return -ENOSPC; >>> prev->start = addr; >>> prev->len = len; >>> prev->bitmap = bitmap; >>> p->found_pages += len; >>> } >>> } >>> >>> Then you can reuse pagemap_scan_deposit() when before returning to >>> userspace, just to flush PREV to p->vec properly in a single helper. >>> It also makes the code slightly easier to read. >> Yeah, this would have worked as you have described. But in >> pagemap_scan_output(), we are flushing prev to p->vec. But later in >> export_prev_to_out() we need to flush prev to user_memory directly. > > I think there's a loop to copy_to_user(). Could you use the new helper so > the copy_to_user() loop will work without export_prev_to_out()? > > I really hope we can get rid of export_prev_to_out(). Thanks, I truly understand how you feel about export_prev_to_out(). It is really difficult to understand. Even I had to made a hard try to come up with the current code to avoid consuming a lot of kernel's memory while giving user the compact output. I can surely map both of these with a dirty looking macro. But I'm unable to find a decent macro to replace these. I think I'll put a comment some where to explain whats going-on.
On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote: > On 2/15/23 1:59 AM, Peter Xu wrote: > [..] > >>>> static inline bool is_pte_written(pte_t pte) > >>>> { > >>>> if ((pte_present(pte) && pte_uffd_wp(pte)) || > >>>> (pte_swp_uffd_wp_any(pte))) > >>>> return false; > >>>> return (pte_present(pte) || is_swap_pte(pte)); > >>>> } > >>> > >>> Could you explain why you don't want to return dirty for !present? A page > >>> can be written then swapped out. Don't you want to know that happened > >>> (from dirty tracking POV)? > >>> > >>> The code looks weird to me too.. We only have three types of ptes: (1) > >>> present, (2) swap, (3) none. > >>> > >>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is > >>> that what you're really looking for? > >> Yes, this is what I've been trying to do. I'll use !pte_none() to make it > >> simpler. > > > > Ah I think I see what you wanted to do now.. But I'm afraid it won't work > > for all cases. > > > > So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't > > persist on anon (but none) ptes, then we got it lost and we cannot identify > > it from pages being written. Your solution will solve problem for > > anonymous, but I think it'll break file memories. > > > > Example: > > > > Consider one shmem page that got mapped, write protected (using UFFDIO_WP > > ioctl), written again (removing uffd-wp bit automatically), then zapped. > > The pte will be pte_none() but it's actually written, afaiu. > > > > Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need > > to install pte markers for anonymous too (then it will work similarly like > > shmem/hugetlbfs, that we'll report writting to zero pages), then you'll > > need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think > > you can keep using the old check and it should start to work. > > > > Please let me know if my understanding is correct above. > Thank you for identifying it. Your understanding seems on point. I'll have > research things up about PTE Markers. I'm looking at your patches about it > [1]. Can you refer me to "mm alignment sessions" discussion in form of > presentation or if any transcript is available? No worry now, after a second thought I think zero page is better than pte markers, and I've got a patch that works for it here by injecting zero pages for anonymous: https://lore.kernel.org/all/20230215210257.224243-1-peterx@redhat.com/ I think we'd also better to enforce your new WP_ASYNC feature bit to depend on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE. Could you please try by rebasing your work upon this one? Hope it'll work for you already. Note again that you'll need to go back to the old is_pte|pmd_written() to make things work always, I think. [...] > I truly understand how you feel about export_prev_to_out(). It is really > difficult to understand. Even I had to made a hard try to come up with the > current code to avoid consuming a lot of kernel's memory while giving user > the compact output. I can surely map both of these with a dirty looking > macro. But I'm unable to find a decent macro to replace these. I think I'll > put a comment some where to explain whats going-on. So maybe I still missed something? I'll read the new version when it comes. Thanks,
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which > pages have been written-to. > - Find pages which have been written-to and write protect the pages > (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > > To get information about which pages have been written-to and/or write > protect the pages, following must be performed first in order: > - The userfaultfd file descriptor is created with userfaultfd syscall. > - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. > - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode > through UFFDIO_REGISTER IOCTL. > Then the any part of the registered memory or the whole memory region > can be write protected using the UFFDIO_WRITEPROTECT IOCTL or > PAGEMAP_SCAN IOCTL. > > struct pagemap_scan_args is used as the argument of the IOCTL. In this > struct: > - The range is specified through start and len. > - The output buffer of struct page_region array and size is specified as > vec and vec_len. > - The optional maximum requested pages are specified in the max_pages. > - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE > is the only added flag at this time. > - The masks are specified in required_mask, anyof_mask, excluded_ mask > and return_mask. > > This IOCTL can be extended to get information about more PTE bits. This > IOCTL doesn't support hugetlbs at the moment. No information about > hugetlb can be obtained. This patch has evolved from a basic patch from > Gabriel Krisman Bertazi. > > Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > --- > Changes in v10: > - move changes in tools/include/uapi/linux/fs.h to separate patch > - update commit message > > Change in v8: > - Correct is_pte_uffd_wp() > - Improve readability and error checks > - Remove some un-needed code > > Changes in v7: > - Rebase on top of latest next > - Fix some corner cases > - Base soft-dirty on the uffd wp async > - Update the terminologies > - Optimize the memory usage inside the ioctl > > Changes in v6: > - Rename variables and update comments > - Make IOCTL independent of soft_dirty config > - Change masks and bitmap type to _u64 > - Improve code quality > > Changes in v5: > - Remove tlb flushing even for clear operation > > Changes in v4: > - Update the interface and implementation > > Changes in v3: > - Tighten the user-kernel interface by using explicit types and add more > error checking > > Changes in v2: > - Convert the interface from syscall to ioctl > - Remove pidfd support as it doesn't make sense in ioctl > --- > fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 50 +++++++ > 2 files changed, 340 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index e35a0398db63..c6bde19d63d9 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -19,6 +19,7 @@ > #include <linux/shmem_fs.h> > #include <linux/uaccess.h> > #include <linux/pkeys.h> > +#include <linux/minmax.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, > } > #endif > > +static inline bool is_pte_uffd_wp(pte_t pte) > +{ > + if ((pte_present(pte) && pte_uffd_wp(pte)) || > + (pte_swp_uffd_wp_any(pte))) > + return true; > + return false; > +} > + > +static inline bool is_pmd_uffd_wp(pmd_t pmd) > +{ > + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || > + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) > + return true; > + return false; > +} > + > #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) > static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, > unsigned long addr, pmd_t *pmdp) > @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) > return 0; > } > > +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ > + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) > +#define IS_GET_OP(a) (a->vec) > +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) > + > +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ > + (wt | file << 1 | present << 2 | swap << 3) > +#define IS_WT_REQUIRED(a) \ > + ((a->required_mask & PAGE_IS_WRITTEN) || \ > + (a->anyof_mask & PAGE_IS_WRITTEN)) All these macros are specific to pagemap_scan_ioctl() and should be namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc. Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros. And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or similar flag rather than using arg->vec. > + > +struct pagemap_scan_private { > + struct page_region *vec; > + struct page_region prev; > + unsigned long vec_len, vec_index; > + unsigned int max_pages, found_pages, flags; > + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; > +}; > + > +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) Please keep the lines under 80 characters limit. > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + > + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) > + return -EPERM; > + if (vma->vm_flags & VM_PFNMAP) > + return 1; > + return 0; > +} > + > +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, unsigned long addr, > + unsigned int len) > +{ > + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); > + bool cpy = true; > + struct page_region *prev = &p->prev; > + > + if (HAS_NO_SPACE(p)) > + return -ENOSPC; > + > + if (p->max_pages && p->found_pages + len >= p->max_pages) > + len = p->max_pages - p->found_pages; > + if (!len) > + return -EINVAL; > + > + if (p->required_mask) > + cpy = ((p->required_mask & cur) == p->required_mask); > + if (cpy && p->anyof_mask) > + cpy = (p->anyof_mask & cur); > + if (cpy && p->excluded_mask) > + cpy = !(p->excluded_mask & cur); > + bitmap = cur & p->return_mask; > + if (cpy && bitmap) { > + if ((prev->len) && (prev->bitmap == bitmap) && > + (prev->start + prev->len * PAGE_SIZE == addr)) { > + prev->len += len; > + p->found_pages += len; > + } else if (p->vec_index < p->vec_len) { > + if (prev->len) { > + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > + p->vec_index++; > + } > + prev->start = addr; > + prev->len = len; > + prev->bitmap = bitmap; > + p->found_pages += len; > + } else { > + return -ENOSPC; > + } > + } > + return 0; Please don't save on empty lines. Empty lines between logical pieces improve readability. > +} > + > +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, > + unsigned long *vec_index) > +{ > + struct page_region *prev = &p->prev; > + > + if (prev->len) { > + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) > + return -EFAULT; > + p->vec_index++; > + (*vec_index)++; > + prev->len = 0; > + } > + return 0; > +} > + > +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long addr = end; > + spinlock_t *ptl; > + int ret = 0; > + pte_t *pte; > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + bool pmd_wt; > + > + pmd_wt = !is_pmd_uffd_wp(*pmd); > + /* > + * Break huge page into small pages if operation needs to be performed is > + * on a portion of the huge page. > + */ > + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { > + spin_unlock(ptl); > + split_huge_pmd(vma, pmd, start); > + goto process_smaller_pages; > + } > + if (IS_GET_OP(p)) > + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), > + is_swap_pmd(*pmd), p, start, > + (end - start)/PAGE_SIZE); > + spin_unlock(ptl); > + if (!ret) { > + if (pmd_wt && IS_WP_ENGAGE_OP(p)) > + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); > + } > + return ret; > + } > +process_smaller_pages: > + if (pmd_trans_unstable(pmd)) > + return 0; > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > + if (IS_GET_OP(p)) { > + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { > + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, > + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); > + if (ret) > + break; > + } > + } > + pte_unmap_unlock(pte - 1, ptl); > + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) > + uffd_wp_range(walk->mm, vma, start, addr - start, true); > + > + cond_resched(); > + return ret; > +} > + > +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, > + struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + int ret = 0; > + > + if (vma) > + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, > + (end - addr)/PAGE_SIZE); > + return ret; > +} > + > +/* No hugetlb support is present. */ > +static const struct mm_walk_ops pagemap_scan_ops = { > + .test_walk = pagemap_scan_test_walk, > + .pmd_entry = pagemap_scan_pmd_entry, > + .pte_hole = pagemap_scan_pte_hole, > +}; > + > +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) > +{ > + unsigned long empty_slots, vec_index = 0; > + unsigned long __user start, end; > + unsigned long __start, __end; > + struct page_region __user *vec; > + struct pagemap_scan_private p; > + int ret = 0; > + > + start = (unsigned long)untagged_addr(arg->start); > + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); > + > + /* Validate memory ranges */ > + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) > + return -EINVAL; > + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || > + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) > + return -EINVAL; > + > + /* Detect illegal flags and masks */ > + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || > + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || > + (arg->return_mask & ~PAGEMAP_BITS_ALL)) > + return -EINVAL; > + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || > + !arg->return_mask)) > + return -EINVAL; > + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ > + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || > + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) > + return -EINVAL; I'd split argument validation into a separate function and split the OR'ed conditions into separate if statements, e.g bool pm_scan_args_valid(struct pagemap_scan_arg *arg) { if (IS_GET_OP(arg)) { if (!arg->return_mask) return false; if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) return false; } /* ... */ return true; } > + > + end = start + arg->len; > + p.max_pages = arg->max_pages; > + p.found_pages = 0; > + p.flags = arg->flags; > + p.required_mask = arg->required_mask; > + p.anyof_mask = arg->anyof_mask; > + p.excluded_mask = arg->excluded_mask; > + p.return_mask = arg->return_mask; > + p.prev.len = 0; > + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > + > + if (IS_GET_OP(arg)) { > + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); > + if (!p.vec) > + return -ENOMEM; > + } else { > + p.vec = NULL; > + } > + __start = __end = start; > + while (!ret && __end < end) { > + p.vec_index = 0; > + empty_slots = arg->vec_len - vec_index; > + if (p.vec_len > empty_slots) > + p.vec_len = empty_slots; > + > + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; > + if (__end > end) > + __end = end; > + > + mmap_read_lock(mm); > + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); > + mmap_read_unlock(mm); > + if (!(!ret || ret == -ENOSPC)) > + goto free_data; > + > + __start = __end; > + if (IS_GET_OP(arg) && p.vec_index) { > + if (copy_to_user(&vec[vec_index], p.vec, > + p.vec_index * sizeof(struct page_region))) { > + ret = -EFAULT; > + goto free_data; > + } > + vec_index += p.vec_index; > + } > + } > + ret = export_prev_to_out(&p, vec, &vec_index); > + if (!ret) > + ret = vec_index; > +free_data: > + if (IS_GET_OP(arg)) > + kfree(p.vec); > + > + return ret; > +} > + > +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > +{ > + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; > + struct mm_struct *mm = file->private_data; > + struct pagemap_scan_arg argument; > + > + if (cmd == PAGEMAP_SCAN) { > + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) > + return -EFAULT; > + return do_pagemap_cmd(mm, &argument); > + } > + return -EINVAL; > +} > + > const struct file_operations proc_pagemap_operations = { > .llseek = mem_lseek, /* borrow this */ > .read = pagemap_read, > .open = pagemap_open, > .release = pagemap_release, > + .unlocked_ioctl = pagemap_scan_ioctl, > + .compat_ioctl = pagemap_scan_ioctl, > }; > #endif /* CONFIG_PROC_PAGE_MONITOR */ > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index b7b56871029c..1ae9a8684b48 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; > #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ > RWF_APPEND) > > +/* Pagemap ioctl */ > +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) > + > +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ > +#define PAGE_IS_WRITTEN (1 << 0) > +#define PAGE_IS_FILE (1 << 1) > +#define PAGE_IS_PRESENT (1 << 2) > +#define PAGE_IS_SWAPPED (1 << 3) > + > +/* > + * struct page_region - Page region with bitmap flags > + * @start: Start of the region > + * @len: Length of the region > + * bitmap: Bits sets for the region > + */ > +struct page_region { > + __u64 start; > + __u64 len; > + __u64 bitmap; > +}; > + > +/* > + * struct pagemap_scan_arg - Pagemap ioctl argument > + * @start: Starting address of the region > + * @len: Length of the region (All the pages in this length are included) > + * @vec: Address of page_region struct array for output > + * @vec_len: Length of the page_region struct array > + * @max_pages: Optional max return pages > + * @flags: Flags for the IOCTL > + * @required_mask: Required mask - All of these bits have to be set in the PTE > + * @anyof_mask: Any mask - Any of these bits are set in the PTE > + * @excluded_mask: Exclude mask - None of these bits are set in the PTE > + * @return_mask: Bits that are to be reported in page_region > + */ > +struct pagemap_scan_arg { > + __u64 start; > + __u64 len; > + __u64 vec; > + __u64 vec_len; > + __u32 max_pages; > + __u32 flags; > + __u64 required_mask; > + __u64 anyof_mask; > + __u64 excluded_mask; > + __u64 return_mask; > +}; > + > +/* Special flags */ > +#define PAGEMAP_WP_ENGAGE (1 << 0) > + > #endif /* _UAPI_LINUX_FS_H */ > -- > 2.30.2 >
On 2/16/23 2:12 AM, Peter Xu wrote: > On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote: >> On 2/15/23 1:59 AM, Peter Xu wrote: >> [..] >>>>>> static inline bool is_pte_written(pte_t pte) >>>>>> { >>>>>> if ((pte_present(pte) && pte_uffd_wp(pte)) || >>>>>> (pte_swp_uffd_wp_any(pte))) >>>>>> return false; >>>>>> return (pte_present(pte) || is_swap_pte(pte)); >>>>>> } >>>>> >>>>> Could you explain why you don't want to return dirty for !present? A page >>>>> can be written then swapped out. Don't you want to know that happened >>>>> (from dirty tracking POV)? >>>>> >>>>> The code looks weird to me too.. We only have three types of ptes: (1) >>>>> present, (2) swap, (3) none. >>>>> >>>>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is >>>>> that what you're really looking for? >>>> Yes, this is what I've been trying to do. I'll use !pte_none() to make it >>>> simpler. >>> >>> Ah I think I see what you wanted to do now.. But I'm afraid it won't work >>> for all cases. >>> >>> So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't >>> persist on anon (but none) ptes, then we got it lost and we cannot identify >>> it from pages being written. Your solution will solve problem for >>> anonymous, but I think it'll break file memories. >>> >>> Example: >>> >>> Consider one shmem page that got mapped, write protected (using UFFDIO_WP >>> ioctl), written again (removing uffd-wp bit automatically), then zapped. >>> The pte will be pte_none() but it's actually written, afaiu. >>> >>> Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need >>> to install pte markers for anonymous too (then it will work similarly like >>> shmem/hugetlbfs, that we'll report writting to zero pages), then you'll >>> need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think >>> you can keep using the old check and it should start to work. >>> >>> Please let me know if my understanding is correct above. >> Thank you for identifying it. Your understanding seems on point. I'll have >> research things up about PTE Markers. I'm looking at your patches about it >> [1]. Can you refer me to "mm alignment sessions" discussion in form of >> presentation or if any transcript is available? > > No worry now, after a second thought I think zero page is better than pte > markers, and I've got a patch that works for it here by injecting zero > pages for anonymous: > > https://lore.kernel.org/all/20230215210257.224243-1-peterx@redhat.com/ > > I think we'd also better to enforce your new WP_ASYNC feature bit to depend > on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE. > > Could you please try by rebasing your work upon this one? Hope it'll work > for you already. Note again that you'll need to go back to the old > is_pte|pmd_written() to make things work always, I think. Thank you so much for sending the ZEROPAGE patch. I've rebased my patches on top of it and my all tests for anon memory are passing. Now we don't need to touch the page before engaging wp. This is what we wanted to achieve. So !wp flag can be easily translated to soft-dirty flag (is_pte_soft_dirty = is_pte_wp). I've only a few file mem and shmem tests. I'll write more tests. > > [...] > >> I truly understand how you feel about export_prev_to_out(). It is really >> difficult to understand. Even I had to made a hard try to come up with the >> current code to avoid consuming a lot of kernel's memory while giving user >> the compact output. I can surely map both of these with a dirty looking >> macro. But I'm unable to find a decent macro to replace these. I think I'll >> put a comment some where to explain whats going-on. > > So maybe I still missed something? I'll read the new version when it comes. Lets reconvene in next patches if you feel like they can be improved. > > Thanks, >
On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: [...] > - The masks are specified in required_mask, anyof_mask, excluded_ mask > and return_mask. [...] May I suggest a slightly modified interface for the flags? As I understand, the return_mask is what is applied to page flags to aggregate the list. This is a separate thing, and I think it doesn't need changes except maybe an improvement in the documentation and visual distinction. For the page-selection mechanism, currently required_mask and excluded_mask have conflicting responsibilities. I suggest to rework that to: 1. negated_flags: page flags which are to be negated before applying the page selection using following masks; 2. required_flags: flags which all have to be set in the (negation-applied) page flags; 3. anyof_flags: flags of which at least one has to be set in the (negation-applied) page flags; IOW, the resulting algorithm would be: tested_flags = page_flags ^ negated_flags; if (~tested_flags & required_flags) skip page; if (!(tested_flags & anyof_flags)) skip_page; aggregate_on(page_flags & return_flags); Best Regards Michał Mirosław
On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote: > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which > pages have been written-to. > - Find pages which have been written-to and write protect the pages > (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > > To get information about which pages have been written-to and/or write > protect the pages, following must be performed first in order: > - The userfaultfd file descriptor is created with userfaultfd syscall. > - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. > - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode > through UFFDIO_REGISTER IOCTL. > Then the any part of the registered memory or the whole memory region > can be write protected using the UFFDIO_WRITEPROTECT IOCTL or > PAGEMAP_SCAN IOCTL. > > struct pagemap_scan_args is used as the argument of the IOCTL. In this > struct: > - The range is specified through start and len. > - The output buffer of struct page_region array and size is specified as > vec and vec_len. > - The optional maximum requested pages are specified in the max_pages. > - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE > is the only added flag at this time. > - The masks are specified in required_mask, anyof_mask, excluded_ mask > and return_mask. > > This IOCTL can be extended to get information about more PTE bits. This > IOCTL doesn't support hugetlbs at the moment. No information about > hugetlb can be obtained. This patch has evolved from a basic patch from > Gabriel Krisman Bertazi. I was not involved before, so I am not commenting on the API and code to avoid making unhelpful noise. Having said that, some things in the code seem quite dirty and make understanding the code hard to read. > Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > --- > Changes in v10: > - move changes in tools/include/uapi/linux/fs.h to separate patch > - update commit message > > Change in v8: > - Correct is_pte_uffd_wp() > - Improve readability and error checks > - Remove some un-needed code > > Changes in v7: > - Rebase on top of latest next > - Fix some corner cases > - Base soft-dirty on the uffd wp async > - Update the terminologies > - Optimize the memory usage inside the ioctl > > Changes in v6: > - Rename variables and update comments > - Make IOCTL independent of soft_dirty config > - Change masks and bitmap type to _u64 > - Improve code quality > > Changes in v5: > - Remove tlb flushing even for clear operation > > Changes in v4: > - Update the interface and implementation > > Changes in v3: > - Tighten the user-kernel interface by using explicit types and add more > error checking > > Changes in v2: > - Convert the interface from syscall to ioctl > - Remove pidfd support as it doesn't make sense in ioctl > --- > fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 50 +++++++ > 2 files changed, 340 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index e35a0398db63..c6bde19d63d9 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -19,6 +19,7 @@ > #include <linux/shmem_fs.h> > #include <linux/uaccess.h> > #include <linux/pkeys.h> > +#include <linux/minmax.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, > } > #endif > > +static inline bool is_pte_uffd_wp(pte_t pte) > +{ > + if ((pte_present(pte) && pte_uffd_wp(pte)) || > + (pte_swp_uffd_wp_any(pte))) > + return true; > + return false; > +} > + > +static inline bool is_pmd_uffd_wp(pmd_t pmd) > +{ > + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || > + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) > + return true; > + return false; > +} > + > #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) > static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, > unsigned long addr, pmd_t *pmdp) > @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) > return 0; > } > > +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ > + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) > +#define IS_GET_OP(a) (a->vec) > +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) I think that in general it is better to have an inline function instead of macros when possible, as it is clearer and checks types. Anyhow, IMHO most of these macros are better be open-coded. > + > +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ > + (wt | file << 1 | present << 2 | swap << 3) > +#define IS_WT_REQUIRED(a) \ > + ((a->required_mask & PAGE_IS_WRITTEN) || \ > + (a->anyof_mask & PAGE_IS_WRITTEN)) > + > +struct pagemap_scan_private { > + struct page_region *vec; > + struct page_region prev; > + unsigned long vec_len, vec_index; > + unsigned int max_pages, found_pages, flags; > + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; > +}; > + > +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + > + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) > + return -EPERM; > + if (vma->vm_flags & VM_PFNMAP) > + return 1; > + return 0; > +} > + > +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, unsigned long addr, > + unsigned int len) > +{ > + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); > + bool cpy = true; > + struct page_region *prev = &p->prev; > + > + if (HAS_NO_SPACE(p)) > + return -ENOSPC; > + > + if (p->max_pages && p->found_pages + len >= p->max_pages) > + len = p->max_pages - p->found_pages; > + if (!len) > + return -EINVAL; > + > + if (p->required_mask) > + cpy = ((p->required_mask & cur) == p->required_mask); > + if (cpy && p->anyof_mask) > + cpy = (p->anyof_mask & cur); > + if (cpy && p->excluded_mask) > + cpy = !(p->excluded_mask & cur); > + bitmap = cur & p->return_mask; > + if (cpy && bitmap) { > + if ((prev->len) && (prev->bitmap == bitmap) && > + (prev->start + prev->len * PAGE_SIZE == addr)) { > + prev->len += len; The use of "len" both for bytes and pages is very confusing. Consider changing the name to n_pages or something similar. > + p->found_pages += len; > + } else if (p->vec_index < p->vec_len) { > + if (prev->len) { > + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); > + p->vec_index++; > + } > + prev->start = addr; > + prev->len = len; > + prev->bitmap = bitmap; > + p->found_pages += len; > + } else { > + return -ENOSPC; > + } > + } > + return 0; > +} > + > +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, > + unsigned long *vec_index) > +{ > + struct page_region *prev = &p->prev; > + > + if (prev->len) { > + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) > + return -EFAULT; > + p->vec_index++; > + (*vec_index)++; > + prev->len = 0; > + } > + return 0; > +} > + > +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long addr = end; > + spinlock_t *ptl; > + int ret = 0; > + pte_t *pte; > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + bool pmd_wt; > + > + pmd_wt = !is_pmd_uffd_wp(*pmd); > + /* > + * Break huge page into small pages if operation needs to be performed is > + * on a portion of the huge page. > + */ > + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { > + spin_unlock(ptl); > + split_huge_pmd(vma, pmd, start); > + goto process_smaller_pages; I think that such goto's are really confusing and should be avoided. And using 'else' (could have easily prevented the need for goto). It is not the best solution though, since I think it would have been better to invert the conditions. > + } > + if (IS_GET_OP(p)) > + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), > + is_swap_pmd(*pmd), p, start, > + (end - start)/PAGE_SIZE); > + spin_unlock(ptl); > + if (!ret) { > + if (pmd_wt && IS_WP_ENGAGE_OP(p)) > + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); > + } > + return ret; > + } > +process_smaller_pages: > + if (pmd_trans_unstable(pmd)) > + return 0; > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > + if (IS_GET_OP(p)) { > + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { > + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, > + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); > + if (ret) > + break; > + } > + } > + pte_unmap_unlock(pte - 1, ptl); We might have not entered the loop and pte-1 would be wrong. > + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) What does 'addr - start' mean? If you want to say they are not equal, why not say so? > + uffd_wp_range(walk->mm, vma, start, addr - start, true); > + > + cond_resched(); > + return ret; > +} > + > +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, > + struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + int ret = 0; > + > + if (vma) > + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, > + (end - addr)/PAGE_SIZE); > + return ret; > +} > + > +/* No hugetlb support is present. */ > +static const struct mm_walk_ops pagemap_scan_ops = { > + .test_walk = pagemap_scan_test_walk, > + .pmd_entry = pagemap_scan_pmd_entry, > + .pte_hole = pagemap_scan_pte_hole, > +}; > + > +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) > +{ > + unsigned long empty_slots, vec_index = 0; > + unsigned long __user start, end; The whole point of __user (attribute) is to be assigned to pointers. > + unsigned long __start, __end; I think such names do not convey sufficient information. > + struct page_region __user *vec; > + struct pagemap_scan_private p; > + int ret = 0; > + > + start = (unsigned long)untagged_addr(arg->start); > + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); > + > + /* Validate memory ranges */ > + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) > + return -EINVAL; > + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || > + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) > + return -EINVAL; > + > + /* Detect illegal flags and masks */ > + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || > + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || > + (arg->return_mask & ~PAGEMAP_BITS_ALL)) Using bitwise or to check (arg->required_mask | arg->anyof_mask | arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL Would have been much cleaner, IMHO. > + return -EINVAL; > + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || > + !arg->return_mask)) > + return -EINVAL; > + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ > + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || > + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) > + return -EINVAL; > + > + end = start + arg->len; > + p.max_pages = arg->max_pages; > + p.found_pages = 0; > + p.flags = arg->flags; > + p.required_mask = arg->required_mask; > + p.anyof_mask = arg->anyof_mask; > + p.excluded_mask = arg->excluded_mask; > + p.return_mask = arg->return_mask; > + p.prev.len = 0; > + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > + > + if (IS_GET_OP(arg)) { > + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); > + if (!p.vec) > + return -ENOMEM; > + } else { > + p.vec = NULL; I find it cleaner to initialize 'p.vec = NULL' unconditionally before IS_GET_OP() check. > + } > + __start = __end = start; > + while (!ret && __end < end) { > + p.vec_index = 0; > + empty_slots = arg->vec_len - vec_index; > + if (p.vec_len > empty_slots) > + p.vec_len = empty_slots; > + > + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; > + if (__end > end) > + __end = end; Easier to understand using min(). > + > + mmap_read_lock(mm); > + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); > + mmap_read_unlock(mm); > + if (!(!ret || ret == -ENOSPC)) Double negations complicate things unnecessarily. And if you already "break" on ret, why do you check the condition in the while loop? > + goto free_data; > + > + __start = __end; > + if (IS_GET_OP(arg) && p.vec_index) { > + if (copy_to_user(&vec[vec_index], p.vec, > + p.vec_index * sizeof(struct page_region))) { > + ret = -EFAULT; > + goto free_data; > + } > + vec_index += p.vec_index; > + } > + } > + ret = export_prev_to_out(&p, vec, &vec_index); > + if (!ret) > + ret = vec_index; > +free_data: > + if (IS_GET_OP(arg)) > + kfree(p.vec); Just call it unconditionally. > + > + return ret; > +} > + > +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > +{ > + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; > + struct mm_struct *mm = file->private_data; > + struct pagemap_scan_arg argument; > + > + if (cmd == PAGEMAP_SCAN) { > + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) > + return -EFAULT; > + return do_pagemap_cmd(mm, &argument); > + } > + return -EINVAL; > +} > + > const struct file_operations proc_pagemap_operations = { > .llseek = mem_lseek, /* borrow this */ > .read = pagemap_read, > .open = pagemap_open, > .release = pagemap_release, > + .unlocked_ioctl = pagemap_scan_ioctl, > + .compat_ioctl = pagemap_scan_ioctl, > }; > #endif /* CONFIG_PROC_PAGE_MONITOR */ > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index b7b56871029c..1ae9a8684b48 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; > #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ > RWF_APPEND) > > +/* Pagemap ioctl */ > +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) > + > +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ > +#define PAGE_IS_WRITTEN (1 << 0) > +#define PAGE_IS_FILE (1 << 1) > +#define PAGE_IS_PRESENT (1 << 2) > +#define PAGE_IS_SWAPPED (1 << 3) These names are way too generic and are likely to be misused for the wrong purpose. The "_IS_" part seems confusing as well. So I think the naming needs to be fixed and some new type (using typedef) or enum should be introduced to hold these flags. I understand it is part of uapi and it is less common there, but it is not unheard of and does make things clearer. > + > +/* > + * struct page_region - Page region with bitmap flags > + * @start: Start of the region > + * @len: Length of the region > + * bitmap: Bits sets for the region > + */ > +struct page_region { > + __u64 start; > + __u64 len; I presume in bytes. Would be useful to mention. > + __u64 bitmap; > +}; > + > +/* > + * struct pagemap_scan_arg - Pagemap ioctl argument > + * @start: Starting address of the region > + * @len: Length of the region (All the pages in this length are included) > + * @vec: Address of page_region struct array for output > + * @vec_len: Length of the page_region struct array > + * @max_pages: Optional max return pages > + * @flags: Flags for the IOCTL > + * @required_mask: Required mask - All of these bits have to be set in the PTE > + * @anyof_mask: Any mask - Any of these bits are set in the PTE > + * @excluded_mask: Exclude mask - None of these bits are set in the PTE > + * @return_mask: Bits that are to be reported in page_region > + */ > +struct pagemap_scan_arg { > + __u64 start; > + __u64 len; > + __u64 vec; > + __u64 vec_len; > + __u32 max_pages; > + __u32 flags; > + __u64 required_mask; > + __u64 anyof_mask; > + __u64 excluded_mask; > + __u64 return_mask; > +}; > + > +/* Special flags */ > +#define PAGEMAP_WP_ENGAGE (1 << 0) > + > #endif /* _UAPI_LINUX_FS_H */
On 2/17/23 3:10 PM, Mike Rapoport wrote: > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are supported >> in this ioctl: >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >> (PAGE_IS_SWAPPED). >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which >> pages have been written-to. >> - Find pages which have been written-to and write protect the pages >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) >> >> To get information about which pages have been written-to and/or write >> protect the pages, following must be performed first in order: >> - The userfaultfd file descriptor is created with userfaultfd syscall. >> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. >> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode >> through UFFDIO_REGISTER IOCTL. >> Then the any part of the registered memory or the whole memory region >> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or >> PAGEMAP_SCAN IOCTL. >> >> struct pagemap_scan_args is used as the argument of the IOCTL. In this >> struct: >> - The range is specified through start and len. >> - The output buffer of struct page_region array and size is specified as >> vec and vec_len. >> - The optional maximum requested pages are specified in the max_pages. >> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE >> is the only added flag at this time. >> - The masks are specified in required_mask, anyof_mask, excluded_ mask >> and return_mask. >> >> This IOCTL can be extended to get information about more PTE bits. This >> IOCTL doesn't support hugetlbs at the moment. No information about >> hugetlb can be obtained. This patch has evolved from a basic patch from >> Gabriel Krisman Bertazi. >> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> >> --- >> Changes in v10: >> - move changes in tools/include/uapi/linux/fs.h to separate patch >> - update commit message >> >> Change in v8: >> - Correct is_pte_uffd_wp() >> - Improve readability and error checks >> - Remove some un-needed code >> >> Changes in v7: >> - Rebase on top of latest next >> - Fix some corner cases >> - Base soft-dirty on the uffd wp async >> - Update the terminologies >> - Optimize the memory usage inside the ioctl >> >> Changes in v6: >> - Rename variables and update comments >> - Make IOCTL independent of soft_dirty config >> - Change masks and bitmap type to _u64 >> - Improve code quality >> >> Changes in v5: >> - Remove tlb flushing even for clear operation >> >> Changes in v4: >> - Update the interface and implementation >> >> Changes in v3: >> - Tighten the user-kernel interface by using explicit types and add more >> error checking >> >> Changes in v2: >> - Convert the interface from syscall to ioctl >> - Remove pidfd support as it doesn't make sense in ioctl >> --- >> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ >> include/uapi/linux/fs.h | 50 +++++++ >> 2 files changed, 340 insertions(+) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index e35a0398db63..c6bde19d63d9 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -19,6 +19,7 @@ >> #include <linux/shmem_fs.h> >> #include <linux/uaccess.h> >> #include <linux/pkeys.h> >> +#include <linux/minmax.h> >> >> #include <asm/elf.h> >> #include <asm/tlb.h> >> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, >> } >> #endif >> >> +static inline bool is_pte_uffd_wp(pte_t pte) >> +{ >> + if ((pte_present(pte) && pte_uffd_wp(pte)) || >> + (pte_swp_uffd_wp_any(pte))) >> + return true; >> + return false; >> +} >> + >> +static inline bool is_pmd_uffd_wp(pmd_t pmd) >> +{ >> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || >> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) >> + return true; >> + return false; >> +} >> + >> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) >> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, >> unsigned long addr, pmd_t *pmdp) >> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ >> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) >> +#define IS_GET_OP(a) (a->vec) >> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) >> + >> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ >> + (wt | file << 1 | present << 2 | swap << 3) >> +#define IS_WT_REQUIRED(a) \ >> + ((a->required_mask & PAGE_IS_WRITTEN) || \ >> + (a->anyof_mask & PAGE_IS_WRITTEN)) > > All these macros are specific to pagemap_scan_ioctl() and should be > namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc. > > Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and > I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make > HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros. Will do in next version. > > And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or > similar flag rather than using arg->vec. I had in the first revisions. But explicit GET_OP was removed in the previous iterations after some feedback. Peter has also suggested this. I'll add the GET_OP flag again. > >> + >> +struct pagemap_scan_private { >> + struct page_region *vec; >> + struct page_region prev; >> + unsigned long vec_len, vec_index; >> + unsigned int max_pages, found_pages, flags; >> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; >> +}; >> + >> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) > > Please keep the lines under 80 characters limit. > >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + >> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) >> + return -EPERM; >> + if (vma->vm_flags & VM_PFNMAP) >> + return 1; >> + return 0; >> +} >> + >> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, >> + struct pagemap_scan_private *p, unsigned long addr, >> + unsigned int len) >> +{ >> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); >> + bool cpy = true; >> + struct page_region *prev = &p->prev; >> + >> + if (HAS_NO_SPACE(p)) >> + return -ENOSPC; >> + >> + if (p->max_pages && p->found_pages + len >= p->max_pages) >> + len = p->max_pages - p->found_pages; >> + if (!len) >> + return -EINVAL; >> + >> + if (p->required_mask) >> + cpy = ((p->required_mask & cur) == p->required_mask); >> + if (cpy && p->anyof_mask) >> + cpy = (p->anyof_mask & cur); >> + if (cpy && p->excluded_mask) >> + cpy = !(p->excluded_mask & cur); >> + bitmap = cur & p->return_mask; >> + if (cpy && bitmap) { >> + if ((prev->len) && (prev->bitmap == bitmap) && >> + (prev->start + prev->len * PAGE_SIZE == addr)) { >> + prev->len += len; >> + p->found_pages += len; >> + } else if (p->vec_index < p->vec_len) { >> + if (prev->len) { >> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); >> + p->vec_index++; >> + } >> + prev->start = addr; >> + prev->len = len; >> + prev->bitmap = bitmap; >> + p->found_pages += len; >> + } else { >> + return -ENOSPC; >> + } >> + } >> + return 0; > > Please don't save on empty lines. Empty lines between logical pieces > improve readability. Sorry, I'll add them. > >> +} >> + >> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, >> + unsigned long *vec_index) >> +{ >> + struct page_region *prev = &p->prev; >> + >> + if (prev->len) { >> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) >> + return -EFAULT; >> + p->vec_index++; >> + (*vec_index)++; >> + prev->len = 0; >> + } >> + return 0; >> +} >> + >> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >> + unsigned long end, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + unsigned long addr = end; >> + spinlock_t *ptl; >> + int ret = 0; >> + pte_t *pte; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + ptl = pmd_trans_huge_lock(pmd, vma); >> + if (ptl) { >> + bool pmd_wt; >> + >> + pmd_wt = !is_pmd_uffd_wp(*pmd); >> + /* >> + * Break huge page into small pages if operation needs to be performed is >> + * on a portion of the huge page. >> + */ >> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { >> + spin_unlock(ptl); >> + split_huge_pmd(vma, pmd, start); >> + goto process_smaller_pages; >> + } >> + if (IS_GET_OP(p)) >> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), >> + is_swap_pmd(*pmd), p, start, >> + (end - start)/PAGE_SIZE); >> + spin_unlock(ptl); >> + if (!ret) { >> + if (pmd_wt && IS_WP_ENGAGE_OP(p)) >> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); >> + } >> + return ret; >> + } >> +process_smaller_pages: >> + if (pmd_trans_unstable(pmd)) >> + return 0; >> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> + >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); >> + if (IS_GET_OP(p)) { >> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { >> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, >> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); >> + if (ret) >> + break; >> + } >> + } >> + pte_unmap_unlock(pte - 1, ptl); >> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) >> + uffd_wp_range(walk->mm, vma, start, addr - start, true); >> + >> + cond_resched(); >> + return ret; >> +} >> + >> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, >> + struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + int ret = 0; >> + >> + if (vma) >> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, >> + (end - addr)/PAGE_SIZE); >> + return ret; >> +} >> + >> +/* No hugetlb support is present. */ >> +static const struct mm_walk_ops pagemap_scan_ops = { >> + .test_walk = pagemap_scan_test_walk, >> + .pmd_entry = pagemap_scan_pmd_entry, >> + .pte_hole = pagemap_scan_pte_hole, >> +}; >> + >> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) >> +{ >> + unsigned long empty_slots, vec_index = 0; >> + unsigned long __user start, end; >> + unsigned long __start, __end; >> + struct page_region __user *vec; >> + struct pagemap_scan_private p; >> + int ret = 0; >> + >> + start = (unsigned long)untagged_addr(arg->start); >> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); >> + >> + /* Validate memory ranges */ >> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) >> + return -EINVAL; >> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || >> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) >> + return -EINVAL; >> + >> + /* Detect illegal flags and masks */ >> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || >> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || >> + (arg->return_mask & ~PAGEMAP_BITS_ALL)) >> + return -EINVAL; >> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || >> + !arg->return_mask)) >> + return -EINVAL; >> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ >> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || >> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) >> + return -EINVAL; > > I'd split argument validation into a separate function and split the OR'ed > conditions into separate if statements, e.g > > bool pm_scan_args_valid(struct pagemap_scan_arg *arg) > { > if (IS_GET_OP(arg)) { > if (!arg->return_mask) > return false; > if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) > return false; > } > > /* ... */ > > return true; > } This seems a very good way. Thank you so much! > >> + >> + end = start + arg->len; >> + p.max_pages = arg->max_pages; >> + p.found_pages = 0; >> + p.flags = arg->flags; >> + p.required_mask = arg->required_mask; >> + p.anyof_mask = arg->anyof_mask; >> + p.excluded_mask = arg->excluded_mask; >> + p.return_mask = arg->return_mask; >> + p.prev.len = 0; >> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); >> + >> + if (IS_GET_OP(arg)) { >> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); >> + if (!p.vec) >> + return -ENOMEM; >> + } else { >> + p.vec = NULL; >> + } >> + __start = __end = start; >> + while (!ret && __end < end) { >> + p.vec_index = 0; >> + empty_slots = arg->vec_len - vec_index; >> + if (p.vec_len > empty_slots) >> + p.vec_len = empty_slots; >> + >> + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; >> + if (__end > end) >> + __end = end; >> + >> + mmap_read_lock(mm); >> + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); >> + mmap_read_unlock(mm); >> + if (!(!ret || ret == -ENOSPC)) >> + goto free_data; >> + >> + __start = __end; >> + if (IS_GET_OP(arg) && p.vec_index) { >> + if (copy_to_user(&vec[vec_index], p.vec, >> + p.vec_index * sizeof(struct page_region))) { >> + ret = -EFAULT; >> + goto free_data; >> + } >> + vec_index += p.vec_index; >> + } >> + } >> + ret = export_prev_to_out(&p, vec, &vec_index); >> + if (!ret) >> + ret = vec_index; >> +free_data: >> + if (IS_GET_OP(arg)) >> + kfree(p.vec); >> + >> + return ret; >> +} >> + >> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) >> +{ >> + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; >> + struct mm_struct *mm = file->private_data; >> + struct pagemap_scan_arg argument; >> + >> + if (cmd == PAGEMAP_SCAN) { >> + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) >> + return -EFAULT; >> + return do_pagemap_cmd(mm, &argument); >> + } >> + return -EINVAL; >> +} >> + >> const struct file_operations proc_pagemap_operations = { >> .llseek = mem_lseek, /* borrow this */ >> .read = pagemap_read, >> .open = pagemap_open, >> .release = pagemap_release, >> + .unlocked_ioctl = pagemap_scan_ioctl, >> + .compat_ioctl = pagemap_scan_ioctl, >> }; >> #endif /* CONFIG_PROC_PAGE_MONITOR */ >> >> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h >> index b7b56871029c..1ae9a8684b48 100644 >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h >> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; >> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ >> RWF_APPEND) >> >> +/* Pagemap ioctl */ >> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) >> + >> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ >> +#define PAGE_IS_WRITTEN (1 << 0) >> +#define PAGE_IS_FILE (1 << 1) >> +#define PAGE_IS_PRESENT (1 << 2) >> +#define PAGE_IS_SWAPPED (1 << 3) >> + >> +/* >> + * struct page_region - Page region with bitmap flags >> + * @start: Start of the region >> + * @len: Length of the region >> + * bitmap: Bits sets for the region >> + */ >> +struct page_region { >> + __u64 start; >> + __u64 len; >> + __u64 bitmap; >> +}; >> + >> +/* >> + * struct pagemap_scan_arg - Pagemap ioctl argument >> + * @start: Starting address of the region >> + * @len: Length of the region (All the pages in this length are included) >> + * @vec: Address of page_region struct array for output >> + * @vec_len: Length of the page_region struct array >> + * @max_pages: Optional max return pages >> + * @flags: Flags for the IOCTL >> + * @required_mask: Required mask - All of these bits have to be set in the PTE >> + * @anyof_mask: Any mask - Any of these bits are set in the PTE >> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE >> + * @return_mask: Bits that are to be reported in page_region >> + */ >> +struct pagemap_scan_arg { >> + __u64 start; >> + __u64 len; >> + __u64 vec; >> + __u64 vec_len; >> + __u32 max_pages; >> + __u32 flags; >> + __u64 required_mask; >> + __u64 anyof_mask; >> + __u64 excluded_mask; >> + __u64 return_mask; >> +}; >> + >> +/* Special flags */ >> +#define PAGEMAP_WP_ENGAGE (1 << 0) >> + >> #endif /* _UAPI_LINUX_FS_H */ >> -- >> 2.30.2 >> >
On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote: >>> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ >>> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >>> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >>> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) >>> +#define IS_GET_OP(a) (a->vec) >>> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) >>> + >>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ >>> + (wt | file << 1 | present << 2 | swap << 3) >>> +#define IS_WT_REQUIRED(a) \ >>> + ((a->required_mask & PAGE_IS_WRITTEN) || \ >>> + (a->anyof_mask & PAGE_IS_WRITTEN)) >> All these macros are specific to pagemap_scan_ioctl() and should be >> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc. >> >> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and >> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make >> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros. > Will do in next version. > IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of open code as they seem more readable to me. I can open code if you insist.
On Mon, Feb 20, 2023 at 04:38:10PM +0500, Muhammad Usama Anjum wrote: > On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote: > >>> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ > >>> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > >>> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) > >>> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) > >>> +#define IS_GET_OP(a) (a->vec) > >>> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) > >>> + > >>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ > >>> + (wt | file << 1 | present << 2 | swap << 3) > >>> +#define IS_WT_REQUIRED(a) \ > >>> + ((a->required_mask & PAGE_IS_WRITTEN) || \ > >>> + (a->anyof_mask & PAGE_IS_WRITTEN)) > >> All these macros are specific to pagemap_scan_ioctl() and should be > >> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc. > >> > >> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and > >> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make > >> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros. > > Will do in next version. > > > > IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to > PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of > open code as they seem more readable to me. I can open code if you insist. I'd suggest to see how the rework of pagemap_scan_pmd_entry() paves out. An open-coded '&' is surely clearer than a macro/function, but if it's buried in a long sequence of conditions, it may be not such clear win. > -- > BR, > Muhammad Usama Anjum
Hello Nadav, Thank you so much for reviewing! On 2/19/23 6:52 PM, Nadav Amit wrote: > > On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote: >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are supported >> in this ioctl: >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >> (PAGE_IS_SWAPPED). >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which >> pages have been written-to. >> - Find pages which have been written-to and write protect the pages >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) >> >> To get information about which pages have been written-to and/or write >> protect the pages, following must be performed first in order: >> - The userfaultfd file descriptor is created with userfaultfd syscall. >> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. >> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode >> through UFFDIO_REGISTER IOCTL. >> Then the any part of the registered memory or the whole memory region >> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or >> PAGEMAP_SCAN IOCTL. >> >> struct pagemap_scan_args is used as the argument of the IOCTL. In this >> struct: >> - The range is specified through start and len. >> - The output buffer of struct page_region array and size is specified as >> vec and vec_len. >> - The optional maximum requested pages are specified in the max_pages. >> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE >> is the only added flag at this time. >> - The masks are specified in required_mask, anyof_mask, excluded_ mask >> and return_mask. >> >> This IOCTL can be extended to get information about more PTE bits. This >> IOCTL doesn't support hugetlbs at the moment. No information about >> hugetlb can be obtained. This patch has evolved from a basic patch from >> Gabriel Krisman Bertazi. > > I was not involved before, so I am not commenting on the API and code to > avoid making unhelpful noise. > > Having said that, some things in the code seem quite dirty and make > understanding the code hard to read. There is a new proposal about the flags in the interface. I'll include you there. > >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> >> --- >> Changes in v10: >> - move changes in tools/include/uapi/linux/fs.h to separate patch >> - update commit message >> >> Change in v8: >> - Correct is_pte_uffd_wp() >> - Improve readability and error checks >> - Remove some un-needed code >> >> Changes in v7: >> - Rebase on top of latest next >> - Fix some corner cases >> - Base soft-dirty on the uffd wp async >> - Update the terminologies >> - Optimize the memory usage inside the ioctl >> >> Changes in v6: >> - Rename variables and update comments >> - Make IOCTL independent of soft_dirty config >> - Change masks and bitmap type to _u64 >> - Improve code quality >> >> Changes in v5: >> - Remove tlb flushing even for clear operation >> >> Changes in v4: >> - Update the interface and implementation >> >> Changes in v3: >> - Tighten the user-kernel interface by using explicit types and add more >> error checking >> >> Changes in v2: >> - Convert the interface from syscall to ioctl >> - Remove pidfd support as it doesn't make sense in ioctl >> --- >> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ >> include/uapi/linux/fs.h | 50 +++++++ >> 2 files changed, 340 insertions(+) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index e35a0398db63..c6bde19d63d9 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -19,6 +19,7 @@ >> #include <linux/shmem_fs.h> >> #include <linux/uaccess.h> >> #include <linux/pkeys.h> >> +#include <linux/minmax.h> >> #include <asm/elf.h> >> #include <asm/tlb.h> >> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct >> vm_area_struct *vma, >> } >> #endif >> +static inline bool is_pte_uffd_wp(pte_t pte) >> +{ >> + if ((pte_present(pte) && pte_uffd_wp(pte)) || >> + (pte_swp_uffd_wp_any(pte))) >> + return true; >> + return false; >> +} >> + >> +static inline bool is_pmd_uffd_wp(pmd_t pmd) >> +{ >> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || >> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) >> + return true; >> + return false; >> +} >> + >> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) >> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, >> unsigned long addr, pmd_t *pmdp) >> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, >> struct file *file) >> return 0; >> } >> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ >> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) >> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | >> PAGE_IS_SWAPPED) >> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) >> +#define IS_GET_OP(a) (a->vec) >> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == >> p->max_pages)) > > I think that in general it is better to have an inline function instead of > macros when possible, as it is clearer and checks types. Anyhow, IMHO most > of these macros are better be open-coded. I'll update most of these in next version. > >> + >> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ >> + (wt | file << 1 | present << 2 | swap << 3) >> +#define IS_WT_REQUIRED(a) \ >> + ((a->required_mask & PAGE_IS_WRITTEN) || \ >> + (a->anyof_mask & PAGE_IS_WRITTEN)) >> + >> +struct pagemap_scan_private { >> + struct page_region *vec; >> + struct page_region prev; >> + unsigned long vec_len, vec_index; >> + unsigned int max_pages, found_pages, flags; >> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; >> +}; >> + >> +static int pagemap_scan_test_walk(unsigned long start, unsigned long >> end, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + >> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && >> !userfaultfd_wp_async(vma)) >> + return -EPERM; >> + if (vma->vm_flags & VM_PFNMAP) >> + return 1; >> + return 0; >> +} >> + >> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, >> bool swap, >> + struct pagemap_scan_private *p, unsigned long addr, >> + unsigned int len) >> +{ >> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); >> + bool cpy = true; >> + struct page_region *prev = &p->prev; >> + >> + if (HAS_NO_SPACE(p)) >> + return -ENOSPC; >> + >> + if (p->max_pages && p->found_pages + len >= p->max_pages) >> + len = p->max_pages - p->found_pages; >> + if (!len) >> + return -EINVAL; >> + >> + if (p->required_mask) >> + cpy = ((p->required_mask & cur) == p->required_mask); >> + if (cpy && p->anyof_mask) >> + cpy = (p->anyof_mask & cur); >> + if (cpy && p->excluded_mask) >> + cpy = !(p->excluded_mask & cur); >> + bitmap = cur & p->return_mask; >> + if (cpy && bitmap) { >> + if ((prev->len) && (prev->bitmap == bitmap) && >> + (prev->start + prev->len * PAGE_SIZE == addr)) { >> + prev->len += len; > The use of "len" both for bytes and pages is very confusing. Consider > changing the name to n_pages or something similar. Will update in next version. >> + p->found_pages += len; >> + } else if (p->vec_index < p->vec_len) { >> + if (prev->len) { >> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct >> page_region)); >> + p->vec_index++; >> + } >> + prev->start = addr; >> + prev->len = len; >> + prev->bitmap = bitmap; >> + p->found_pages += len; >> + } else { >> + return -ENOSPC; >> + } >> + } >> + return 0; >> +} >> + >> +static inline int export_prev_to_out(struct pagemap_scan_private *p, >> struct page_region __user *vec, >> + unsigned long *vec_index) >> +{ >> + struct page_region *prev = &p->prev; >> + >> + if (prev->len) { >> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct >> page_region))) >> + return -EFAULT; >> + p->vec_index++; >> + (*vec_index)++; >> + prev->len = 0; >> + } >> + return 0; >> +} >> + >> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >> + unsigned long end, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + unsigned long addr = end; >> + spinlock_t *ptl; >> + int ret = 0; >> + pte_t *pte; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + ptl = pmd_trans_huge_lock(pmd, vma); >> + if (ptl) { >> + bool pmd_wt; >> + >> + pmd_wt = !is_pmd_uffd_wp(*pmd); >> + /* >> + * Break huge page into small pages if operation needs to be >> performed is >> + * on a portion of the huge page. >> + */ >> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { >> + spin_unlock(ptl); >> + split_huge_pmd(vma, pmd, start); >> + goto process_smaller_pages; > I think that such goto's are really confusing and should be avoided. And > using 'else' (could have easily prevented the need for goto). It is not the > best solution though, since I think it would have been better to invert the > conditions. Yeah, else can be used here. But then we'll have to add a tab to all the code after adding else. We have already so many tabs and very less space to right code. Not sure which is better. >> + } >> + if (IS_GET_OP(p)) >> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, >> pmd_present(*pmd), >> + is_swap_pmd(*pmd), p, start, >> + (end - start)/PAGE_SIZE); >> + spin_unlock(ptl); >> + if (!ret) { >> + if (pmd_wt && IS_WP_ENGAGE_OP(p)) >> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); >> + } >> + return ret; >> + } >> +process_smaller_pages: >> + if (pmd_trans_unstable(pmd)) >> + return 0; >> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> + >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); >> + if (IS_GET_OP(p)) { >> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { >> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, >> + pte_present(*pte), is_swap_pte(*pte), p, addr, >> 1); >> + if (ret) >> + break; >> + } >> + } >> + pte_unmap_unlock(pte - 1, ptl); > We might have not entered the loop and pte-1 would be wrong. >> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) > What does 'addr - start' mean? If you want to say they are not equal, why > not say so? This has been revamped in the next version. >> + uffd_wp_range(walk->mm, vma, start, addr - start, true); >> + >> + cond_resched(); >> + return ret; >> +} >> + >> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, >> int depth, >> + struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + int ret = 0; >> + >> + if (vma) >> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, >> addr, >> + (end - addr)/PAGE_SIZE); >> + return ret; >> +} >> + >> +/* No hugetlb support is present. */ >> +static const struct mm_walk_ops pagemap_scan_ops = { >> + .test_walk = pagemap_scan_test_walk, >> + .pmd_entry = pagemap_scan_pmd_entry, >> + .pte_hole = pagemap_scan_pte_hole, >> +}; >> + >> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg >> *arg) >> +{ >> + unsigned long empty_slots, vec_index = 0; >> + unsigned long __user start, end; > > The whole point of __user (attribute) is to be assigned to pointers. I'll remove it. > >> + unsigned long __start, __end; > > I think such names do not convey sufficient information. I'll update it. > >> + struct page_region __user *vec; >> + struct pagemap_scan_private p; >> + int ret = 0; >> + >> + start = (unsigned long)untagged_addr(arg->start); >> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); >> + >> + /* Validate memory ranges */ >> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user >> *)start, arg->len))) >> + return -EINVAL; >> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || >> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct >> page_region))))) >> + return -EINVAL; >> + >> + /* Detect illegal flags and masks */ >> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & >> ~PAGEMAP_BITS_ALL) || >> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & >> ~PAGEMAP_BITS_ALL) || >> + (arg->return_mask & ~PAGEMAP_BITS_ALL)) > > Using bitwise or to check > > (arg->required_mask | arg->anyof_mask | > arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL > > Would have been much cleaner, IMHO. I'll update it. > >> + return -EINVAL; >> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && >> !arg->excluded_mask) || >> + !arg->return_mask)) >> + return -EINVAL; >> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also >> specified. */ >> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & >> PAGEMAP_NON_WRITTEN_BITS) || >> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) >> + return -EINVAL; >> + >> + end = start + arg->len; >> + p.max_pages = arg->max_pages; >> + p.found_pages = 0; >> + p.flags = arg->flags; >> + p.required_mask = arg->required_mask; >> + p.anyof_mask = arg->anyof_mask; >> + p.excluded_mask = arg->excluded_mask; >> + p.return_mask = arg->return_mask; >> + p.prev.len = 0; >> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); >> + >> + if (IS_GET_OP(arg)) { >> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), >> GFP_KERNEL); >> + if (!p.vec) >> + return -ENOMEM; >> + } else { >> + p.vec = NULL; > I find it cleaner to initialize 'p.vec = NULL' unconditionally before > IS_GET_OP() check. It'll get updated. >> + } >> + __start = __end = start; >> + while (!ret && __end < end) { >> + p.vec_index = 0; >> + empty_slots = arg->vec_len - vec_index; >> + if (p.vec_len > empty_slots) >> + p.vec_len = empty_slots; >> + >> + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; >> + if (__end > end) >> + __end = end; > Easier to understand using min(). Will update. >> + >> + mmap_read_lock(mm); >> + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); >> + mmap_read_unlock(mm); >> + if (!(!ret || ret == -ENOSPC)) > > Double negations complicate things unnecessarily. > > And if you already "break" on ret, why do you check the condition in the > while loop? Ohh, good catch. > >> + goto free_data; >> + >> + __start = __end; >> + if (IS_GET_OP(arg) && p.vec_index) { >> + if (copy_to_user(&vec[vec_index], p.vec, >> + p.vec_index * sizeof(struct page_region))) { >> + ret = -EFAULT; >> + goto free_data; >> + } >> + vec_index += p.vec_index; >> + } >> + } >> + ret = export_prev_to_out(&p, vec, &vec_index); >> + if (!ret) >> + ret = vec_index; >> +free_data: >> + if (IS_GET_OP(arg)) >> + kfree(p.vec); > Just call it unconditionally. I didn't know it. I'll do it. >> + >> + return ret; >> +} >> + >> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, >> unsigned long arg) >> +{ >> + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg >> __user *)arg; >> + struct mm_struct *mm = file->private_data; >> + struct pagemap_scan_arg argument; >> + >> + if (cmd == PAGEMAP_SCAN) { >> + if (copy_from_user(&argument, uarg, sizeof(struct >> pagemap_scan_arg))) >> + return -EFAULT; >> + return do_pagemap_cmd(mm, &argument); >> + } >> + return -EINVAL; >> +} >> + >> const struct file_operations proc_pagemap_operations = { >> .llseek = mem_lseek, /* borrow this */ >> .read = pagemap_read, >> .open = pagemap_open, >> .release = pagemap_release, >> + .unlocked_ioctl = pagemap_scan_ioctl, >> + .compat_ioctl = pagemap_scan_ioctl, >> }; >> #endif /* CONFIG_PROC_PAGE_MONITOR */ >> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h >> index b7b56871029c..1ae9a8684b48 100644 >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h >> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; >> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ >> RWF_APPEND) >> +/* Pagemap ioctl */ >> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) >> + >> +/* Bits are set in the bitmap of the page_region and masks in >> pagemap_scan_args */ >> +#define PAGE_IS_WRITTEN (1 << 0) >> +#define PAGE_IS_FILE (1 << 1) >> +#define PAGE_IS_PRESENT (1 << 2) >> +#define PAGE_IS_SWAPPED (1 << 3) > > These names are way too generic and are likely to be misused for the wrong > purpose. The "_IS_" part seems confusing as well. So I think the naming > needs to be fixed and some new type (using typedef) or enum should be > introduced to hold these flags. I understand it is part of uapi and it is > less common there, but it is not unheard of and does make things clearer. Do you think PM_SCAN_PAGE_IS_* work here? > > >> + >> +/* >> + * struct page_region - Page region with bitmap flags >> + * @start: Start of the region >> + * @len: Length of the region >> + * bitmap: Bits sets for the region >> + */ >> +struct page_region { >> + __u64 start; >> + __u64 len; > > I presume in bytes. Would be useful to mention. Length of region in pages. > >> + __u64 bitmap; >> +}; >> + >> +/* >> + * struct pagemap_scan_arg - Pagemap ioctl argument >> + * @start: Starting address of the region >> + * @len: Length of the region (All the pages in this length are >> included) >> + * @vec: Address of page_region struct array for output >> + * @vec_len: Length of the page_region struct array >> + * @max_pages: Optional max return pages >> + * @flags: Flags for the IOCTL >> + * @required_mask: Required mask - All of these bits have to be set >> in the PTE >> + * @anyof_mask: Any mask - Any of these bits are set in the PTE >> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE >> + * @return_mask: Bits that are to be reported in page_region >> + */ >> +struct pagemap_scan_arg { >> + __u64 start; >> + __u64 len; >> + __u64 vec; >> + __u64 vec_len; >> + __u32 max_pages; >> + __u32 flags; >> + __u64 required_mask; >> + __u64 anyof_mask; >> + __u64 excluded_mask; >> + __u64 return_mask; >> +}; >> + >> +/* Special flags */ >> +#define PAGEMAP_WP_ENGAGE (1 << 0) >> + >> #endif /* _UAPI_LINUX_FS_H */
Hi, On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which > pages have been written-to. > - Find pages which have been written-to and write protect the pages > (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > > +/* > + * struct pagemap_scan_arg - Pagemap ioctl argument > + * @start: Starting address of the region > + * @len: Length of the region (All the pages in this length are included) > + * @vec: Address of page_region struct array for output > + * @vec_len: Length of the page_region struct array > + * @max_pages: Optional max return pages > + * @flags: Flags for the IOCTL > + * @required_mask: Required mask - All of these bits have to be set in the PTE > + * @anyof_mask: Any mask - Any of these bits are set in the PTE > + * @excluded_mask: Exclude mask - None of these bits are set in the PTE > + * @return_mask: Bits that are to be reported in page_region > + */ > +struct pagemap_scan_arg { > + __u64 start; > + __u64 len; > + __u64 vec; > + __u64 vec_len; > + __u32 max_pages; > + __u32 flags; > + __u64 required_mask; > + __u64 anyof_mask; > + __u64 excluded_mask; > + __u64 return_mask; > +}; After Nadav's comment I've realized I missed the API part :) A few quick notes for now: * The arg struct is fixed, so it would be impossible to extend the API later. Following the clone3() example, I'd add 'size' field to the pagemam_scan_arg so that it would be possible to add new fields afterwards. * Please make flags __u64, just in case * Put size and flags at the beginning of the struct, e.g. strucr pagemap_scan_arg { size_t size; __u64 flags; /* all the rest */ }; > + > +/* Special flags */ > +#define PAGEMAP_WP_ENGAGE (1 << 0) > + > #endif /* _UAPI_LINUX_FS_H */ > -- > 2.30.2 >
On 2/20/23 6:26 PM, Mike Rapoport wrote: > Hi, > > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are supported >> in this ioctl: >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >> (PAGE_IS_SWAPPED). >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which >> pages have been written-to. >> - Find pages which have been written-to and write protect the pages >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) >> >> +/* >> + * struct pagemap_scan_arg - Pagemap ioctl argument >> + * @start: Starting address of the region >> + * @len: Length of the region (All the pages in this length are included) >> + * @vec: Address of page_region struct array for output >> + * @vec_len: Length of the page_region struct array >> + * @max_pages: Optional max return pages >> + * @flags: Flags for the IOCTL >> + * @required_mask: Required mask - All of these bits have to be set in the PTE >> + * @anyof_mask: Any mask - Any of these bits are set in the PTE >> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE >> + * @return_mask: Bits that are to be reported in page_region >> + */ >> +struct pagemap_scan_arg { >> + __u64 start; >> + __u64 len; >> + __u64 vec; >> + __u64 vec_len; >> + __u32 max_pages; >> + __u32 flags; >> + __u64 required_mask; >> + __u64 anyof_mask; >> + __u64 excluded_mask; >> + __u64 return_mask; >> +}; > > After Nadav's comment I've realized I missed the API part :) > > A few quick notes for now: > * The arg struct is fixed, so it would be impossible to extend the API > later. Following the clone3() example, I'd add 'size' field to the > pagemam_scan_arg so that it would be possible to add new fields afterwards. > * Please make flags __u64, just in case > * Put size and flags at the beginning of the struct, e.g. > > strucr pagemap_scan_arg { > size_t size; > __u64 flags; > /* all the rest */ > }; Updated. Thank you so much! > >> + >> +/* Special flags */ >> +#define PAGEMAP_WP_ENGAGE (1 << 0) >> + >> #endif /* _UAPI_LINUX_FS_H */ >> -- >> 2.30.2 >> >
Hi Michał, Thank you so much for comment! On 2/17/23 8:18 PM, Michał Mirosław wrote: > On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: > [...] >> - The masks are specified in required_mask, anyof_mask, excluded_ mask >> and return_mask. > [...] The interface was suggested by Andrei back on the review of v3 [1]: > I mean we should be able to specify for what pages we need to get info > for. An ioctl argument can have these four fields: > * required bits (rmask & mask == mask) - all bits from this mask have to be set. > * any of these bits (amask & mask != 0) - any of these bits is set. > * exclude masks (emask & mask == 0) = none of these bits are set. > * return mask - bits that have to be reported to user. > > May I suggest a slightly modified interface for the flags? I've added everyone who may be interested in making interface better. > > As I understand, the return_mask is what is applied to page flags to > aggregate the list. > This is a separate thing, and I think it doesn't need changes except > maybe an improvement > in the documentation and visual distinction. > > For the page-selection mechanism, currently required_mask and > excluded_mask have conflicting They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected. > responsibilities. I suggest to rework that to: > 1. negated_flags: page flags which are to be negated before applying > the page selection using following masks; Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0 If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped? When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation? > 2. required_flags: flags which all have to be set in the > (negation-applied) page flags; > 3. anyof_flags: flags of which at least one has to be set in the > (negation-applied) page flags; > > IOW, the resulting algorithm would be: > > tested_flags = page_flags ^ negated_flags; > if (~tested_flags & required_flags) > skip page; > if (!(tested_flags & anyof_flags)) > skip_page; > > aggregate_on(page_flags & return_flags); > > Best Regards > Michał Mirosław [1] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > Hi Michał, > > Thank you so much for comment! > > On 2/17/23 8:18 PM, Michał Mirosław wrote: [...] > > For the page-selection mechanism, currently required_mask and > > excluded_mask have conflicting > They are opposite of each other: > All the set bits in required_mask must be set for the page to be selected. > All the set bits in excluded_mask must _not_ be set for the page to be > selected. > > > responsibilities. I suggest to rework that to: > > 1. negated_flags: page flags which are to be negated before applying > > the page selection using following masks; > Sorry I'm unable to understand the negation (which is XOR?). Lets look at > the truth table: > Page Flag negated_flags > 0 0 0 > 0 1 1 > 1 0 1 > 1 1 0 > > If a page flag is 0 and negated_flag is 1, the result would be 1 which has > changed the page flag. It isn't making sense to me. Why the page flag bit > is being fliped? > > When Anrdei had proposed these masks, they seemed like a fancy way of > filtering inside kernel and it was straight forward to understand. These > masks would help his use cases for CRIU. So I'd included it. Please can you > elaborate what is the purpose of negation? The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call). (Note: the XOR is applied only to the value of the flags for the purpose of testing page-selection criteria.) So: 1. if a flag is not set in negated_flags, but set in required_flags, then it means "this flag must be one" - equivalent to it being set in required_flag (in your current version of the API). 2. if a flag is set in negated_flags and also in required_flags, then it means "this flag must be zero" - equivalent to it being set in excluded_flags. The same thing goes for anyof_flags: if a flag is set in anyof_flags, then for it to be considered matched: 1. it must have a value of 1 if it is not set in negated_flags 2. it must have a value of 0 if it is set in negated_flags BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention? The example code has a bug though, in that if anyof_flags is zero it will never match. Let me fix the selection part: // calc. a mask of flags that have expected ("active") values tested_flags = page_flags ^ negated_flags; // are all required flags in "active" state? [== all zero when negated] if (~tested_flags & required_mask) skip page; // is any extra flag "active"? if (anyof_flags && !(tested_flags & anyof_flags)) skip page; Best Regards Michał Mirosław
On 2/21/23 5:42 PM, Michał Mirosław wrote: > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> >> Hi Michał, >> >> Thank you so much for comment! >> >> On 2/17/23 8:18 PM, Michał Mirosław wrote: > [...] >>> For the page-selection mechanism, currently required_mask and >>> excluded_mask have conflicting >> They are opposite of each other: >> All the set bits in required_mask must be set for the page to be selected. >> All the set bits in excluded_mask must _not_ be set for the page to be >> selected. >> >>> responsibilities. I suggest to rework that to: >>> 1. negated_flags: page flags which are to be negated before applying >>> the page selection using following masks; >> Sorry I'm unable to understand the negation (which is XOR?). Lets look at >> the truth table: >> Page Flag negated_flags >> 0 0 0 >> 0 1 1 >> 1 0 1 >> 1 1 0 >> >> If a page flag is 0 and negated_flag is 1, the result would be 1 which has >> changed the page flag. It isn't making sense to me. Why the page flag bit >> is being fliped? >> >> When Anrdei had proposed these masks, they seemed like a fancy way of >> filtering inside kernel and it was straight forward to understand. These >> masks would help his use cases for CRIU. So I'd included it. Please can you >> elaborate what is the purpose of negation? > > The XOR is a way to invert the tested value of a flag (from positive > to negative and the other way) without having the API with invalid > values (with required_flags and excluded_flags you need to define a > rule about what happens if a flag is present in both of the masks - > either prioritise one mask over the other or reject the call). At minimum, one mask (required, any or excluded) must be specified. For a page to get selected, the page flags must fulfill the criterion of all the specified masks. If a flag is present in both required_mask and excluded_mask, the required_mask would select a page. But exculded_mask would drop the page. So page page would be dropped. It is responsibility of the user to correctly specify the flags. matched = true; if (p->required_mask) matched = ((p->required_mask & bitmap) == p->required_mask); if (matched && p->anyof_mask) matched = (p->anyof_mask & bitmap); if (matched && p->excluded_mask) matched = !(p->excluded_mask & bitmap); if (matched && bitmap) { // page selected } Do you accept/like this behavior of masks after explaintation? > (Note: the XOR is applied only to the value of the flags for the > purpose of testing page-selection criteria.) > > So: > 1. if a flag is not set in negated_flags, but set in required_flags, > then it means "this flag must be one" - equivalent to it being set in > required_flag (in your current version of the API). > 2. if a flag is set in negated_flags and also in required_flags, then > it means "this flag must be zero" - equivalent to it being set in > excluded_flags. Lets translate words into table: pageflags required_flags negated_flags matched 1 1 0 yes 0 1 1 yes > > The same thing goes for anyof_flags: if a flag is set in anyof_flags, > then for it to be considered matched: > 1. it must have a value of 1 if it is not set in negated_flags > 2. it must have a value of 0 if it is set in negated_flags pageflags anyof_flags negated_flags matched 1 1 0 yes 0 1 1 yes > > BTW, I think I assumed that both conditions (all flags in > required_flags and at least one in anyof_flags is present) need to be > true for the page to be selected - is this your intention? All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected. > The example > code has a bug though, in that if anyof_flags is zero it will never > match. Let me fix the selection part: > > // calc. a mask of flags that have expected ("active") values > tested_flags = page_flags ^ negated_flags; > // are all required flags in "active" state? [== all zero when negated] > if (~tested_flags & required_mask) > skip page; > // is any extra flag "active"? > if (anyof_flags && !(tested_flags & anyof_flags)) > skip page; > After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier. > > Best Regards > Michał Mirosław
On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > On 2/21/23 5:42 PM, Michał Mirosław wrote: > > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: > >> > >> Hi Michał, > >> > >> Thank you so much for comment! > >> > >> On 2/17/23 8:18 PM, Michał Mirosław wrote: > > [...] > >>> For the page-selection mechanism, currently required_mask and > >>> excluded_mask have conflicting > >> They are opposite of each other: > >> All the set bits in required_mask must be set for the page to be selected. > >> All the set bits in excluded_mask must _not_ be set for the page to be > >> selected. > >> > >>> responsibilities. I suggest to rework that to: > >>> 1. negated_flags: page flags which are to be negated before applying > >>> the page selection using following masks; > >> Sorry I'm unable to understand the negation (which is XOR?). Lets look at > >> the truth table: > >> Page Flag negated_flags > >> 0 0 0 > >> 0 1 1 > >> 1 0 1 > >> 1 1 0 > >> > >> If a page flag is 0 and negated_flag is 1, the result would be 1 which has > >> changed the page flag. It isn't making sense to me. Why the page flag bit > >> is being fliped? > >> > >> When Anrdei had proposed these masks, they seemed like a fancy way of > >> filtering inside kernel and it was straight forward to understand. These > >> masks would help his use cases for CRIU. So I'd included it. Please can you > >> elaborate what is the purpose of negation? > > > > The XOR is a way to invert the tested value of a flag (from positive > > to negative and the other way) without having the API with invalid > > values (with required_flags and excluded_flags you need to define a > > rule about what happens if a flag is present in both of the masks - > > either prioritise one mask over the other or reject the call). > At minimum, one mask (required, any or excluded) must be specified. For a > page to get selected, the page flags must fulfill the criterion of all the > specified masks. [Please see the comment below.] [...] > Lets translate words into table: [Yes, those tables captured the intent correctly.] > > BTW, I think I assumed that both conditions (all flags in > > required_flags and at least one in anyof_flags is present) need to be > > true for the page to be selected - is this your intention? > All the masks are optional. If all or any of the 3 masks are specified, the > page flags must pass these masks to get selected. This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected. > > The example > > code has a bug though, in that if anyof_flags is zero it will never > > match. Let me fix the selection part: > > > > // calc. a mask of flags that have expected ("active") values > > tested_flags = page_flags ^ negated_flags; > > // are all required flags in "active" state? [== all zero when negated] > > if (~tested_flags & required_mask) > > skip page; > > // is any extra flag "active"? > > if (anyof_flags && !(tested_flags & anyof_flags)) > > skip page; > > > After taking a while to understand this and compare with already present > flag system, `negated flags` is comparatively difficult to understand while > already present flags seem easier. Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better? We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags. IOW my proposal is to replace branches in the masks interpretation (if in one set then matches but if in another set then doesn't; if flags match ... ) with plain calculation (flag is matching when equals ~negated_flags; if flags match the masks ...). Best Regards Michał Mirosław
On 2/22/23 3:44 PM, Michał Mirosław wrote: > On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> On 2/21/23 5:42 PM, Michał Mirosław wrote: >>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum >>> <usama.anjum@collabora.com> wrote: >>>> >>>> Hi Michał, >>>> >>>> Thank you so much for comment! >>>> >>>> On 2/17/23 8:18 PM, Michał Mirosław wrote: >>> [...] >>>>> For the page-selection mechanism, currently required_mask and >>>>> excluded_mask have conflicting >>>> They are opposite of each other: >>>> All the set bits in required_mask must be set for the page to be selected. >>>> All the set bits in excluded_mask must _not_ be set for the page to be >>>> selected. >>>> >>>>> responsibilities. I suggest to rework that to: >>>>> 1. negated_flags: page flags which are to be negated before applying >>>>> the page selection using following masks; >>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at >>>> the truth table: >>>> Page Flag negated_flags >>>> 0 0 0 >>>> 0 1 1 >>>> 1 0 1 >>>> 1 1 0 >>>> >>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has >>>> changed the page flag. It isn't making sense to me. Why the page flag bit >>>> is being fliped? >>>> >>>> When Anrdei had proposed these masks, they seemed like a fancy way of >>>> filtering inside kernel and it was straight forward to understand. These >>>> masks would help his use cases for CRIU. So I'd included it. Please can you >>>> elaborate what is the purpose of negation? >>> >>> The XOR is a way to invert the tested value of a flag (from positive >>> to negative and the other way) without having the API with invalid >>> values (with required_flags and excluded_flags you need to define a >>> rule about what happens if a flag is present in both of the masks - >>> either prioritise one mask over the other or reject the call). >> At minimum, one mask (required, any or excluded) must be specified. For a >> page to get selected, the page flags must fulfill the criterion of all the >> specified masks. > > [Please see the comment below.] > > [...] >> Lets translate words into table: > [Yes, those tables captured the intent correctly.] > >>> BTW, I think I assumed that both conditions (all flags in >>> required_flags and at least one in anyof_flags is present) need to be >>> true for the page to be selected - is this your intention? >> All the masks are optional. If all or any of the 3 masks are specified, the >> page flags must pass these masks to get selected. > > This explanation contradicts in part the introductory paragraph, but > this version seems more useful as you can pass all masks zero to have > all pages selected. Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero. > >>> The example >>> code has a bug though, in that if anyof_flags is zero it will never >>> match. Let me fix the selection part: >>> >>> // calc. a mask of flags that have expected ("active") values >>> tested_flags = page_flags ^ negated_flags; >>> // are all required flags in "active" state? [== all zero when negated] >>> if (~tested_flags & required_mask) >>> skip page; >>> // is any extra flag "active"? >>> if (anyof_flags && !(tested_flags & anyof_flags)) >>> skip page; >>> >> After taking a while to understand this and compare with already present >> flag system, `negated flags` is comparatively difficult to understand while >> already present flags seem easier. > > Maybe replacing negated_flags in the API with matched_values = > ~negated_flags would make this better? > > We compare having to understand XOR vs having to understand ordering > of required_flags and excluded_flags. There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly. Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.) > IOW my proposal is to replace branches in the masks interpretation (if > in one set then matches but if in another set then doesn't; if flags > match ... ) with plain calculation (flag is matching when equals > ~negated_flags; if flags match the masks ...). > > Best Regards > Michał Mirosław
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > On 2/22/23 3:44 PM, Michał Mirosław wrote: > > On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: > >> On 2/21/23 5:42 PM, Michał Mirosław wrote: > >>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum > >>> <usama.anjum@collabora.com> wrote: > >>>> > >>>> Hi Michał, > >>>> > >>>> Thank you so much for comment! > >>>> > >>>> On 2/17/23 8:18 PM, Michał Mirosław wrote: > >>> [...] > >>>>> For the page-selection mechanism, currently required_mask and > >>>>> excluded_mask have conflicting > >>>> They are opposite of each other: > >>>> All the set bits in required_mask must be set for the page to be selected. > >>>> All the set bits in excluded_mask must _not_ be set for the page to be > >>>> selected. > >>>> > >>>>> responsibilities. I suggest to rework that to: > >>>>> 1. negated_flags: page flags which are to be negated before applying > >>>>> the page selection using following masks; > >>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at > >>>> the truth table: > >>>> Page Flag negated_flags > >>>> 0 0 0 > >>>> 0 1 1 > >>>> 1 0 1 > >>>> 1 1 0 > >>>> > >>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has > >>>> changed the page flag. It isn't making sense to me. Why the page flag bit > >>>> is being fliped? > >>>> > >>>> When Anrdei had proposed these masks, they seemed like a fancy way of > >>>> filtering inside kernel and it was straight forward to understand. These > >>>> masks would help his use cases for CRIU. So I'd included it. Please can you > >>>> elaborate what is the purpose of negation? > >>> > >>> The XOR is a way to invert the tested value of a flag (from positive > >>> to negative and the other way) without having the API with invalid > >>> values (with required_flags and excluded_flags you need to define a > >>> rule about what happens if a flag is present in both of the masks - > >>> either prioritise one mask over the other or reject the call). > >> At minimum, one mask (required, any or excluded) must be specified. For a > >> page to get selected, the page flags must fulfill the criterion of all the > >> specified masks. > > > > [Please see the comment below.] > > > > [...] > >> Lets translate words into table: > > [Yes, those tables captured the intent correctly.] > > > >>> BTW, I think I assumed that both conditions (all flags in > >>> required_flags and at least one in anyof_flags is present) need to be > >>> true for the page to be selected - is this your intention? > >> All the masks are optional. If all or any of the 3 masks are specified, the > >> page flags must pass these masks to get selected. > > > > This explanation contradicts in part the introductory paragraph, but > > this version seems more useful as you can pass all masks zero to have > > all pages selected. > Sorry, I wrote it wrongly. (All the masks are not optional.) Let me > rephrase. All or at least any 1 of the 3 masks (required, any, exclude) > must be specified. The return_mask must always be specified. Error is > returned if all 3 masks (required, anyof, exclude) are zero or return_mask > is zero. Why do you need those restrictions? I'd guess it is valid to request a list of all pages with zero return_mask - this will return a compact list of used ranges of the virtual address space. > >> After taking a while to understand this and compare with already present > >> flag system, `negated flags` is comparatively difficult to understand while > >> already present flags seem easier. > > > > Maybe replacing negated_flags in the API with matched_values = > > ~negated_flags would make this better? > > > > We compare having to understand XOR vs having to understand ordering > > of required_flags and excluded_flags. > There is no ordering in current masks scheme. No mask is preferable. For a > page to get selected, all the definitions of the masks must be fulfilled. > You have come up with good example that what if required_mask = > exclude_mask. In this case, no page will fulfill the criterion and hence no > page would be selected. It is user's fault that he isn't understanding the > definitions of these masks correctly. > > Now thinking about it, I can add a error check which would return error if > a bit in required and excluded masks matches. Would you like it? Lets put > this check in place. > (Previously I'd left it for user's wisdom not to do this. If he'll specify > same masks in them, he'll get no addresses out of the syscall.) This error case is (one of) the problems I propose avoiding. You also need much more text to describe the requred/excluded flags interactions and edge cases than saying that a flag must have a value equal to corresponding bit in ~negated_flags to be matched by requried/anyof masks. > > IOW my proposal is to replace branches in the masks interpretation (if > > in one set then matches but if in another set then doesn't; if flags > > match ... ) with plain calculation (flag is matching when equals > > ~negated_flags; if flags match the masks ...). Best Regards Michał Mirosław
> On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > >>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >>> + unsigned long end, struct mm_walk *walk) >>> +{ >>> + struct pagemap_scan_private *p = walk->private; >>> + struct vm_area_struct *vma = walk->vma; >>> + unsigned long addr = end; >>> + spinlock_t *ptl; >>> + int ret = 0; >>> + pte_t *pte; >>> + >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> + ptl = pmd_trans_huge_lock(pmd, vma); >>> + if (ptl) { >>> + bool pmd_wt; >>> + >>> + pmd_wt = !is_pmd_uffd_wp(*pmd); >>> + /* >>> + * Break huge page into small pages if operation needs to be >>> performed is >>> + * on a portion of the huge page. >>> + */ >>> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { >>> + spin_unlock(ptl); >>> + split_huge_pmd(vma, pmd, start); >>> + goto process_smaller_pages; >> I think that such goto's are really confusing and should be avoided. And >> using 'else' (could have easily prevented the need for goto). It is not the >> best solution though, since I think it would have been better to invert the >> conditions. > Yeah, else can be used here. But then we'll have to add a tab to all the > code after adding else. We have already so many tabs and very less space to > right code. Not sure which is better. goto’s are usually not the right solution. You can extract things into a different function if you have to. I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the lock and dropping it in such a case. I think that the code can be simplified and additional condition nesting can be avoided. >>> --- a/include/uapi/linux/fs.h >>> +++ b/include/uapi/linux/fs.h >>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; >>> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ >>> RWF_APPEND) >>> +/* Pagemap ioctl */ >>> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) >>> + >>> +/* Bits are set in the bitmap of the page_region and masks in >>> pagemap_scan_args */ >>> +#define PAGE_IS_WRITTEN (1 << 0) >>> +#define PAGE_IS_FILE (1 << 1) >>> +#define PAGE_IS_PRESENT (1 << 2) >>> +#define PAGE_IS_SWAPPED (1 << 3) >> >> These names are way too generic and are likely to be misused for the wrong >> purpose. The "_IS_" part seems confusing as well. So I think the naming >> needs to be fixed and some new type (using typedef) or enum should be >> introduced to hold these flags. I understand it is part of uapi and it is >> less common there, but it is not unheard of and does make things clearer. > Do you think PM_SCAN_PAGE_IS_* work here? Can we lose the IS somehow? > >> >> >>> + >>> +/* >>> + * struct page_region - Page region with bitmap flags >>> + * @start: Start of the region >>> + * @len: Length of the region >>> + * bitmap: Bits sets for the region >>> + */ >>> +struct page_region { >>> + __u64 start; >>> + __u64 len; >> >> I presume in bytes. Would be useful to mention. > Length of region in pages. Very unintuitive to me I must say. If the start is an address, I would expect the len to be in bytes.
On 2/22/23 4:48 PM, Michał Mirosław wrote: > On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> >> On 2/22/23 3:44 PM, Michał Mirosław wrote: >>> On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum >>> <usama.anjum@collabora.com> wrote: >>>> On 2/21/23 5:42 PM, Michał Mirosław wrote: >>>>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum >>>>> <usama.anjum@collabora.com> wrote: >>>>>> >>>>>> Hi Michał, >>>>>> >>>>>> Thank you so much for comment! >>>>>> >>>>>> On 2/17/23 8:18 PM, Michał Mirosław wrote: >>>>> [...] >>>>>>> For the page-selection mechanism, currently required_mask and >>>>>>> excluded_mask have conflicting >>>>>> They are opposite of each other: >>>>>> All the set bits in required_mask must be set for the page to be selected. >>>>>> All the set bits in excluded_mask must _not_ be set for the page to be >>>>>> selected. >>>>>> >>>>>>> responsibilities. I suggest to rework that to: >>>>>>> 1. negated_flags: page flags which are to be negated before applying >>>>>>> the page selection using following masks; >>>>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at >>>>>> the truth table: >>>>>> Page Flag negated_flags >>>>>> 0 0 0 >>>>>> 0 1 1 >>>>>> 1 0 1 >>>>>> 1 1 0 >>>>>> >>>>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has >>>>>> changed the page flag. It isn't making sense to me. Why the page flag bit >>>>>> is being fliped? >>>>>> >>>>>> When Anrdei had proposed these masks, they seemed like a fancy way of >>>>>> filtering inside kernel and it was straight forward to understand. These >>>>>> masks would help his use cases for CRIU. So I'd included it. Please can you >>>>>> elaborate what is the purpose of negation? >>>>> >>>>> The XOR is a way to invert the tested value of a flag (from positive >>>>> to negative and the other way) without having the API with invalid >>>>> values (with required_flags and excluded_flags you need to define a >>>>> rule about what happens if a flag is present in both of the masks - >>>>> either prioritise one mask over the other or reject the call). >>>> At minimum, one mask (required, any or excluded) must be specified. For a >>>> page to get selected, the page flags must fulfill the criterion of all the >>>> specified masks. >>> >>> [Please see the comment below.] >>> >>> [...] >>>> Lets translate words into table: >>> [Yes, those tables captured the intent correctly.] >>> >>>>> BTW, I think I assumed that both conditions (all flags in >>>>> required_flags and at least one in anyof_flags is present) need to be >>>>> true for the page to be selected - is this your intention? >>>> All the masks are optional. If all or any of the 3 masks are specified, the >>>> page flags must pass these masks to get selected. >>> >>> This explanation contradicts in part the introductory paragraph, but >>> this version seems more useful as you can pass all masks zero to have >>> all pages selected. >> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me >> rephrase. All or at least any 1 of the 3 masks (required, any, exclude) >> must be specified. The return_mask must always be specified. Error is >> returned if all 3 masks (required, anyof, exclude) are zero or return_mask >> is zero. > > Why do you need those restrictions? I'd guess it is valid to request a > list of all pages with zero return_mask - this will return a compact > list of used ranges of the virtual address space. At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his flags of interest in the return_mask. If he wants only 1 flag, he'll specify it. Definitely if user wants only 1 flag, initially it doesn't make any sense to mention in the return mask. But we want uniformity. If user want, 2 or more flags in returned, return_mask becomes compulsory. So to keep things simple and generic for any number of flags of interest returned, the return_mask must be specified even if the flag of interest is only 1. > >>>> After taking a while to understand this and compare with already present >>>> flag system, `negated flags` is comparatively difficult to understand while >>>> already present flags seem easier. >>> >>> Maybe replacing negated_flags in the API with matched_values = >>> ~negated_flags would make this better? >>> >>> We compare having to understand XOR vs having to understand ordering >>> of required_flags and excluded_flags. >> There is no ordering in current masks scheme. No mask is preferable. For a >> page to get selected, all the definitions of the masks must be fulfilled. >> You have come up with good example that what if required_mask = >> exclude_mask. In this case, no page will fulfill the criterion and hence no >> page would be selected. It is user's fault that he isn't understanding the >> definitions of these masks correctly. >> >> Now thinking about it, I can add a error check which would return error if >> a bit in required and excluded masks matches. Would you like it? Lets put >> this check in place. >> (Previously I'd left it for user's wisdom not to do this. If he'll specify >> same masks in them, he'll get no addresses out of the syscall.) > > This error case is (one of) the problems I propose avoiding. You also > need much more text to describe the requred/excluded flags > interactions and edge cases than saying that a flag must have a value > equal to corresponding bit in ~negated_flags to be matched by > requried/anyof masks. I've found excluded_mask very intuitive as compared to negated_mask which is so difficult to understand that I don't know how to use it correctly. Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or PAGE_IS_SWAPPED. This can be specified as: required_mask = PAGE_IS_WRITTEN excluded_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP (a) assume page_flags = 0b1111 skip page as 0b1111 & 0b0010 = true (b) assume page_flags = 0b1001 select page as 0b1001 & 0b0010 = false It seemed intuitive. Right? How would you achieve same thing with negated_mask? required_mask = PAGE_IS_WRITTEN negated_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP (1) assume page_flags = 0b1111 tested_flags = 0b1111 ^ 0b0010 = 0b1101 (2) assume page_flags = 0b1001 tested_flags = 0b1001 ^ 0b0010 = 0b1011 In (1), we wanted to skip pages which have PAGE_IS_FILE set. But negated_mask has just masked it and page is still getting tested if it should be selected and it would get selected. It is wrong. In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or PAGE_IS_FILE in tested_flags. > >>> IOW my proposal is to replace branches in the masks interpretation (if >>> in one set then matches but if in another set then doesn't; if flags >>> match ... ) with plain calculation (flag is matching when equals >>> ~negated_flags; if flags match the masks ...). > > Best Regards > Michał Mirosław
Hi Nadav, Mike, Michał, Can you please share your thoughts at [A] below? On 2/23/23 12:10 AM, Nadav Amit wrote: > > >> On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: >> >>>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >>>> + unsigned long end, struct mm_walk *walk) >>>> +{ >>>> + struct pagemap_scan_private *p = walk->private; >>>> + struct vm_area_struct *vma = walk->vma; >>>> + unsigned long addr = end; >>>> + spinlock_t *ptl; >>>> + int ret = 0; >>>> + pte_t *pte; >>>> + >>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>> + ptl = pmd_trans_huge_lock(pmd, vma); >>>> + if (ptl) { >>>> + bool pmd_wt; >>>> + >>>> + pmd_wt = !is_pmd_uffd_wp(*pmd); >>>> + /* >>>> + * Break huge page into small pages if operation needs to be >>>> performed is >>>> + * on a portion of the huge page. >>>> + */ >>>> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { >>>> + spin_unlock(ptl); >>>> + split_huge_pmd(vma, pmd, start); >>>> + goto process_smaller_pages; >>> I think that such goto's are really confusing and should be avoided. And >>> using 'else' (could have easily prevented the need for goto). It is not the >>> best solution though, since I think it would have been better to invert the >>> conditions. >> Yeah, else can be used here. But then we'll have to add a tab to all the >> code after adding else. We have already so many tabs and very less space to >> right code. Not sure which is better. > > goto’s are usually not the right solution. You can extract things into a different > function if you have to. > > I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the > lock and dropping it in such a case. I think that the code can be simplified and > additional condition nesting can be avoided. Lock is taken and we check if pmd has UFFD_WP set or not. In the next version, the GET check has been removed as we have dropped WP_ENGAGE + !GET operation. So get is always specified and condition isn't needed. Please comment on next version if you want anything more optimized. > >>>> --- a/include/uapi/linux/fs.h >>>> +++ b/include/uapi/linux/fs.h >>>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; >>>> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ >>>> RWF_APPEND) >>>> +/* Pagemap ioctl */ >>>> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) >>>> + >>>> +/* Bits are set in the bitmap of the page_region and masks in >>>> pagemap_scan_args */ >>>> +#define PAGE_IS_WRITTEN (1 << 0) >>>> +#define PAGE_IS_FILE (1 << 1) >>>> +#define PAGE_IS_PRESENT (1 << 2) >>>> +#define PAGE_IS_SWAPPED (1 << 3) >>> >>> These names are way too generic and are likely to be misused for the wrong >>> purpose. The "_IS_" part seems confusing as well. So I think the naming >>> needs to be fixed and some new type (using typedef) or enum should be >>> introduced to hold these flags. I understand it is part of uapi and it is >>> less common there, but it is not unheard of and does make things clearer. >> Do you think PM_SCAN_PAGE_IS_* work here? > > Can we lose the IS somehow? [A] Do you think these names would work better: PM_SCAN_WRITTEN_PAGE, PM_SCAN_FILE_PAGE, PM_SCAN_SWAP_PAGE, PM_SCAN_PRESENT_PAGE? > >> >>> >>> >>>> + >>>> +/* >>>> + * struct page_region - Page region with bitmap flags >>>> + * @start: Start of the region >>>> + * @len: Length of the region >>>> + * bitmap: Bits sets for the region >>>> + */ >>>> +struct page_region { >>>> + __u64 start; >>>> + __u64 len; >>> >>> I presume in bytes. Would be useful to mention. >> Length of region in pages. > > Very unintuitive to me I must say. If the start is an address, I would expect > the len to be in bytes. The PAGEMAP_SCAN ioctl is working on page granularity level. We tell the user if a page has certain flags are not. Keeping length in bytes doesn't makes sense. >
On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > On 2/22/23 4:48 PM, Michał Mirosław wrote: > > On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: [...] > >>>>> BTW, I think I assumed that both conditions (all flags in > >>>>> required_flags and at least one in anyof_flags is present) need to be > >>>>> true for the page to be selected - is this your intention? > >>>> All the masks are optional. If all or any of the 3 masks are specified, the > >>>> page flags must pass these masks to get selected. > >>> > >>> This explanation contradicts in part the introductory paragraph, but > >>> this version seems more useful as you can pass all masks zero to have > >>> all pages selected. > >> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me > >> rephrase. All or at least any 1 of the 3 masks (required, any, exclude) > >> must be specified. The return_mask must always be specified. Error is > >> returned if all 3 masks (required, anyof, exclude) are zero or return_mask > >> is zero. > > > > Why do you need those restrictions? I'd guess it is valid to request a > > list of all pages with zero return_mask - this will return a compact > > list of used ranges of the virtual address space. > At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, > PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his > flags of interest in the return_mask. If he wants only 1 flag, he'll > specify it. Definitely if user wants only 1 flag, initially it doesn't make > any sense to mention in the return mask. But we want uniformity. If user > want, 2 or more flags in returned, return_mask becomes compulsory. So to > keep things simple and generic for any number of flags of interest > returned, the return_mask must be specified even if the flag of interest is > only 1. I'm not sure why do we want uniformity in the case of 1 flag? If a user specifies a single required flag, I'd expect he doesn't need to look at the flags returned as those will duplicate the information from mere presence of a page. A user might also require a single flag, but want all of them returned. Both requests - return 1 flag and return 0 flags would give meaningful output, so why force one way or the other? Allowing two will also enable users to express the intent: they need either just a list of pages, or they need a list with per-page flags - the need would follow from the code structure or other factors. > >>>> After taking a while to understand this and compare with already present > >>>> flag system, `negated flags` is comparatively difficult to understand while > >>>> already present flags seem easier. > >>> > >>> Maybe replacing negated_flags in the API with matched_values = > >>> ~negated_flags would make this better? > >>> > >>> We compare having to understand XOR vs having to understand ordering > >>> of required_flags and excluded_flags. > >> There is no ordering in current masks scheme. No mask is preferable. For a > >> page to get selected, all the definitions of the masks must be fulfilled. > >> You have come up with good example that what if required_mask = > >> exclude_mask. In this case, no page will fulfill the criterion and hence no > >> page would be selected. It is user's fault that he isn't understanding the > >> definitions of these masks correctly. > >> > >> Now thinking about it, I can add a error check which would return error if > >> a bit in required and excluded masks matches. Would you like it? Lets put > >> this check in place. > >> (Previously I'd left it for user's wisdom not to do this. If he'll specify > >> same masks in them, he'll get no addresses out of the syscall.) > > > > This error case is (one of) the problems I propose avoiding. You also > > need much more text to describe the requred/excluded flags > > interactions and edge cases than saying that a flag must have a value > > equal to corresponding bit in ~negated_flags to be matched by > > requried/anyof masks. > I've found excluded_mask very intuitive as compared to negated_mask which > is so difficult to understand that I don't know how to use it correctly. > Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not > PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or > PAGE_IS_SWAPPED. This can be specified as: > > required_mask = PAGE_IS_WRITTEN > excluded_mask = PAGE_IS_FILE > anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP > > (a) assume page_flags = 0b1111 > skip page as 0b1111 & 0b0010 = true > > (b) assume page_flags = 0b1001 > select page as 0b1001 & 0b0010 = false > > It seemed intuitive. Right? How would you achieve same thing with negated_mask? > > required_mask = PAGE_IS_WRITTEN > negated_mask = PAGE_IS_FILE > anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP > > (1) assume page_flags = 0b1111 > tested_flags = 0b1111 ^ 0b0010 = 0b1101 > > (2) assume page_flags = 0b1001 > tested_flags = 0b1001 ^ 0b0010 = 0b1011 > > In (1), we wanted to skip pages which have PAGE_IS_FILE set. But > negated_mask has just masked it and page is still getting tested if it > should be selected and it would get selected. It is wrong. > > In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or > PAGE_IS_FILE in tested_flags. I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so: required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE; negated_flags = PAGE_IS_FILE; // flags I want zero I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so: anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP; Another case: I want to analyse a process' working set: required_mask = 0; negated_flags = PAGE_IS_FILE; anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN; -> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0]. To clarify a bit: negated_flags doesn't mask anything: the field inverts values of the flags (marks some "active low", if you consider electronic signal analogy). Best Regards Michał Mirosław
On 2/23/23 1:41 PM, Michał Mirosław wrote: > On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> >> On 2/22/23 4:48 PM, Michał Mirosław wrote: >>> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum >>> <usama.anjum@collabora.com> wrote: > [...] >>>>>>> BTW, I think I assumed that both conditions (all flags in >>>>>>> required_flags and at least one in anyof_flags is present) need to be >>>>>>> true for the page to be selected - is this your intention? >>>>>> All the masks are optional. If all or any of the 3 masks are specified, the >>>>>> page flags must pass these masks to get selected. >>>>> >>>>> This explanation contradicts in part the introductory paragraph, but >>>>> this version seems more useful as you can pass all masks zero to have >>>>> all pages selected. >>>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me >>>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude) >>>> must be specified. The return_mask must always be specified. Error is >>>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask >>>> is zero. >>> >>> Why do you need those restrictions? I'd guess it is valid to request a >>> list of all pages with zero return_mask - this will return a compact >>> list of used ranges of the virtual address space. >> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, >> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his >> flags of interest in the return_mask. If he wants only 1 flag, he'll >> specify it. Definitely if user wants only 1 flag, initially it doesn't make >> any sense to mention in the return mask. But we want uniformity. If user >> want, 2 or more flags in returned, return_mask becomes compulsory. So to >> keep things simple and generic for any number of flags of interest >> returned, the return_mask must be specified even if the flag of interest is >> only 1. > > I'm not sure why do we want uniformity in the case of 1 flag? If a > user specifies a single required flag, I'd expect he doesn't need to > look at the flags returned as those will duplicate the information > from mere presence of a page. A user might also require a single flag, > but want all of them returned. Both requests - return 1 flag and > return 0 flags would give meaningful output, so why force one way or > the other? Allowing two will also enable users to express the intent: > they need either just a list of pages, or they need a list with > per-page flags - the need would follow from the code structure or > other factors. We can add as much flexibility as much people ask by keeping code simple. But it is going to be dirty to add error check which detects if return_mask = 0 and if there is only 1 flag of interest mentioned by the user. The following mentioned error check is essential to return deterministic output. Do you think this case is worth it to support and we don't want to go with the generality for both 1 or more flag cases? if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1) return error; > >>>>>> After taking a while to understand this and compare with already present >>>>>> flag system, `negated flags` is comparatively difficult to understand while >>>>>> already present flags seem easier. >>>>> >>>>> Maybe replacing negated_flags in the API with matched_values = >>>>> ~negated_flags would make this better? >>>>> >>>>> We compare having to understand XOR vs having to understand ordering >>>>> of required_flags and excluded_flags. >>>> There is no ordering in current masks scheme. No mask is preferable. For a >>>> page to get selected, all the definitions of the masks must be fulfilled. >>>> You have come up with good example that what if required_mask = >>>> exclude_mask. In this case, no page will fulfill the criterion and hence no >>>> page would be selected. It is user's fault that he isn't understanding the >>>> definitions of these masks correctly. >>>> >>>> Now thinking about it, I can add a error check which would return error if >>>> a bit in required and excluded masks matches. Would you like it? Lets put >>>> this check in place. >>>> (Previously I'd left it for user's wisdom not to do this. If he'll specify >>>> same masks in them, he'll get no addresses out of the syscall.) >>> >>> This error case is (one of) the problems I propose avoiding. You also >>> need much more text to describe the requred/excluded flags >>> interactions and edge cases than saying that a flag must have a value >>> equal to corresponding bit in ~negated_flags to be matched by >>> requried/anyof masks. >> I've found excluded_mask very intuitive as compared to negated_mask which >> is so difficult to understand that I don't know how to use it correctly. >> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not >> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or >> PAGE_IS_SWAPPED. This can be specified as: >> >> required_mask = PAGE_IS_WRITTEN >> excluded_mask = PAGE_IS_FILE >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP >> >> (a) assume page_flags = 0b1111 >> skip page as 0b1111 & 0b0010 = true >> >> (b) assume page_flags = 0b1001 >> select page as 0b1001 & 0b0010 = false >> >> It seemed intuitive. Right? How would you achieve same thing with negated_mask? >> >> required_mask = PAGE_IS_WRITTEN >> negated_mask = PAGE_IS_FILE >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP >> >> (1) assume page_flags = 0b1111 >> tested_flags = 0b1111 ^ 0b0010 = 0b1101 >> >> (2) assume page_flags = 0b1001 >> tested_flags = 0b1001 ^ 0b0010 = 0b1011 >> >> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But >> negated_mask has just masked it and page is still getting tested if it >> should be selected and it would get selected. It is wrong. >> >> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or >> PAGE_IS_FILE in tested_flags. > > I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so: > > required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE; > negated_flags = PAGE_IS_FILE; // flags I want zero You want PAGE_IS_FILE to be zero and at the same time you are requiring the PAGE_IS_FILE. It is confusing. Lets go with excluded mask and excluded_mask must never have any bit matching with required_mask. Lets stay with this as it is intuitive and would be easy to use from the user's perspective. Andrei and Danylo had suggested these mask scheme and have use cases for this. Andrei and Danylo can please comment as well. > > I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so: > > anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP; > > Another case: I want to analyse a process' working set: > > required_mask = 0; > negated_flags = PAGE_IS_FILE; > anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN; > > -> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0]. > > To clarify a bit: negated_flags doesn't mask anything: the field > inverts values of the flags (marks some "active low", if you consider > electronic signal analogy). > > Best Regards > Michał Mirosław
On Thu, 23 Feb 2023 at 10:23, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > On 2/23/23 1:41 PM, Michał Mirosław wrote: > > On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: > >> > >> On 2/22/23 4:48 PM, Michał Mirosław wrote: > >>> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum > >>> <usama.anjum@collabora.com> wrote: > > [...] > >>>>>>> BTW, I think I assumed that both conditions (all flags in > >>>>>>> required_flags and at least one in anyof_flags is present) need to be > >>>>>>> true for the page to be selected - is this your intention? > >>>>>> All the masks are optional. If all or any of the 3 masks are specified, the > >>>>>> page flags must pass these masks to get selected. > >>>>> > >>>>> This explanation contradicts in part the introductory paragraph, but > >>>>> this version seems more useful as you can pass all masks zero to have > >>>>> all pages selected. > >>>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me > >>>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude) > >>>> must be specified. The return_mask must always be specified. Error is > >>>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask > >>>> is zero. > >>> > >>> Why do you need those restrictions? I'd guess it is valid to request a > >>> list of all pages with zero return_mask - this will return a compact > >>> list of used ranges of the virtual address space. > >> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, > >> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his > >> flags of interest in the return_mask. If he wants only 1 flag, he'll > >> specify it. Definitely if user wants only 1 flag, initially it doesn't make > >> any sense to mention in the return mask. But we want uniformity. If user > >> want, 2 or more flags in returned, return_mask becomes compulsory. So to > >> keep things simple and generic for any number of flags of interest > >> returned, the return_mask must be specified even if the flag of interest is > >> only 1. > > > > I'm not sure why do we want uniformity in the case of 1 flag? If a > > user specifies a single required flag, I'd expect he doesn't need to > > look at the flags returned as those will duplicate the information > > from mere presence of a page. A user might also require a single flag, > > but want all of them returned. Both requests - return 1 flag and > > return 0 flags would give meaningful output, so why force one way or > > the other? Allowing two will also enable users to express the intent: > > they need either just a list of pages, or they need a list with > > per-page flags - the need would follow from the code structure or > > other factors. > We can add as much flexibility as much people ask by keeping code simple. > But it is going to be dirty to add error check which detects if return_mask > = 0 and if there is only 1 flag of interest mentioned by the user. The > following mentioned error check is essential to return deterministic > output. Do you think this case is worth it to support and we don't want to > go with the generality for both 1 or more flag cases? > > if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1) > return error; Why would you want to add this error check? If a user requires multiple flags but cares only about a list of matching pages, then it would be natural to express this intent as return_mask = 0. > >>>>>> After taking a while to understand this and compare with already present > >>>>>> flag system, `negated flags` is comparatively difficult to understand while > >>>>>> already present flags seem easier. > >>>>> > >>>>> Maybe replacing negated_flags in the API with matched_values = > >>>>> ~negated_flags would make this better? > >>>>> > >>>>> We compare having to understand XOR vs having to understand ordering > >>>>> of required_flags and excluded_flags. > >>>> There is no ordering in current masks scheme. No mask is preferable. For a > >>>> page to get selected, all the definitions of the masks must be fulfilled. > >>>> You have come up with good example that what if required_mask = > >>>> exclude_mask. In this case, no page will fulfill the criterion and hence no > >>>> page would be selected. It is user's fault that he isn't understanding the > >>>> definitions of these masks correctly. > >>>> > >>>> Now thinking about it, I can add a error check which would return error if > >>>> a bit in required and excluded masks matches. Would you like it? Lets put > >>>> this check in place. > >>>> (Previously I'd left it for user's wisdom not to do this. If he'll specify > >>>> same masks in them, he'll get no addresses out of the syscall.) > >>> > >>> This error case is (one of) the problems I propose avoiding. You also > >>> need much more text to describe the requred/excluded flags > >>> interactions and edge cases than saying that a flag must have a value > >>> equal to corresponding bit in ~negated_flags to be matched by > >>> requried/anyof masks. > >> I've found excluded_mask very intuitive as compared to negated_mask which > >> is so difficult to understand that I don't know how to use it correctly. > >> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not > >> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or > >> PAGE_IS_SWAPPED. This can be specified as: > >> > >> required_mask = PAGE_IS_WRITTEN > >> excluded_mask = PAGE_IS_FILE > >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP > >> > >> (a) assume page_flags = 0b1111 > >> skip page as 0b1111 & 0b0010 = true > >> > >> (b) assume page_flags = 0b1001 > >> select page as 0b1001 & 0b0010 = false > >> > >> It seemed intuitive. Right? How would you achieve same thing with negated_mask? > >> > >> required_mask = PAGE_IS_WRITTEN > >> negated_mask = PAGE_IS_FILE > >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP > >> > >> (1) assume page_flags = 0b1111 > >> tested_flags = 0b1111 ^ 0b0010 = 0b1101 > >> > >> (2) assume page_flags = 0b1001 > >> tested_flags = 0b1001 ^ 0b0010 = 0b1011 > >> > >> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But > >> negated_mask has just masked it and page is still getting tested if it > >> should be selected and it would get selected. It is wrong. > >> > >> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or > >> PAGE_IS_FILE in tested_flags. > > > > I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so: > > > > required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE; > > negated_flags = PAGE_IS_FILE; // flags I want zero > You want PAGE_IS_FILE to be zero and at the same time you are requiring the > PAGE_IS_FILE. It is confusing. Ok, I believe the misunderstanding comes from the naming. I "require" the flag to be a particular value - hence include it in "required_flags" and specify the required value in ~negated_flags. You "require" the flag to be set (equal 1) and so include it in "required_flags" and you "require" the flag to be clear (equal to 0) so include it in "excluded_flags". Both approaches are correct, but I would not consider one "easier" than the other. The former is more general, though - makes any_of also able to match on flags cleared and removes the possibility of a conflicting case of a flag present in both sets. Maybe considered_flags or matched_flags then would make the field better understandable? Best Regards Michał Mirosław
> On Feb 22, 2023, at 11:10 PM, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > Hi Nadav, Mike, Michał, > > Can you please share your thoughts at [A] below? I promised I won't talk about the API, but was persuaded to reconsider. I have a general question regarding the suitablity of currently proposed high-level API. To explore some alternatives, I'd like to suggest an alternative that may have some advantages. If these have already been considered and dismissed, feel free to ignore. I believe we have two distinct usage scenarios: (1) vectored reads from pagemap, and (2) atomic UFFD WP-read/protect. It's possible that these require separate interfaces Regarding vectored reads, I believe the simplest solution is to maintain the current pagemap entry format for output and extend it if necessary. The input can be a vector of ranges. I'm uncertain about the purpose of fields such as 'anyof_mask' in 'pagemap_scan_arg', so I can't confirm their necessity and whether the input need to be made. more complicated. There is a possibility that fields such as 'anyof_mask' might expose internal APIs, so I hope they’re not required. For the atomic operation of 'PAGE_IS_WRITTEN' + 'PAGEMAP_WP_ENGAGE', a different mechanism might be necessary. This function appears to be UFFD-specific. Instead of the proposed IOCTL, an alternative option is to use 'UFFD_FEATURE_WP_ASYNC' to log the pages that were written, similar to page-modification logging on Intel. Since this feature appears to be specific to UFFD, I believe it would be more appropriate to include the log as part of the UFFD mechanism rather than the pagemap. From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full. I am sorry that I chime in that late, but I think the complications that the proposed mechanism might raise are not negligible. And anyhow this patch-set still requires quite a bit of work before it can be merged.
On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław <emmir@google.com> wrote: > > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: > > > > Hi Michał, > > > > Thank you so much for comment! > > > > On 2/17/23 8:18 PM, Michał Mirosław wrote: > [...] > > > For the page-selection mechanism, currently required_mask and > > > excluded_mask have conflicting > > They are opposite of each other: > > All the set bits in required_mask must be set for the page to be selected. > > All the set bits in excluded_mask must _not_ be set for the page to be > > selected. > > > > > responsibilities. I suggest to rework that to: > > > 1. negated_flags: page flags which are to be negated before applying > > > the page selection using following masks; > > Sorry I'm unable to understand the negation (which is XOR?). Lets look at > > the truth table: > > Page Flag negated_flags > > 0 0 0 > > 0 1 1 > > 1 0 1 > > 1 1 0 > > > > If a page flag is 0 and negated_flag is 1, the result would be 1 which has > > changed the page flag. It isn't making sense to me. Why the page flag bit > > is being fliped? > > > > When Anrdei had proposed these masks, they seemed like a fancy way of > > filtering inside kernel and it was straight forward to understand. These > > masks would help his use cases for CRIU. So I'd included it. Please can you > > elaborate what is the purpose of negation? > > The XOR is a way to invert the tested value of a flag (from positive > to negative and the other way) without having the API with invalid > values (with required_flags and excluded_flags you need to define a > rule about what happens if a flag is present in both of the masks - > either prioritise one mask over the other or reject the call). > (Note: the XOR is applied only to the value of the flags for the > purpose of testing page-selection criteria.) Michał, Your API isn't much different from the current one, but it requires a bit more brain activity for understanding. The current set of masks can be easy translated to the new one: negated_flags = excluded_flags required_flags_new = excluded_flags | required_flags As for invalid values, I think it is an advantage of the current API. I mean we can easily detect invalid values and return EINVAL. With your API, such mistakes will be undetectable. As for priorities, I don't see this problem here If I don't miss something. We can rewrite the code this way: ``` if (required_mask && ((page_flags & required_mask) != required_mask) skip page; if (anyof_mask && !(page_flags & anyof_mask)) skip page; if (page_flags & excluded_mask) skip page; ``` I think the result is always the same no matter in what order each mask is applied. Thanks, Andrei
On Fri, 24 Feb 2023 at 03:20, Andrei Vagin <avagin@gmail.com> wrote: > > On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław <emmir@google.com> wrote: > > > > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: > > > > > > Hi Michał, > > > > > > Thank you so much for comment! > > > > > > On 2/17/23 8:18 PM, Michał Mirosław wrote: > > [...] > > > > For the page-selection mechanism, currently required_mask and > > > > excluded_mask have conflicting > > > They are opposite of each other: > > > All the set bits in required_mask must be set for the page to be selected. > > > All the set bits in excluded_mask must _not_ be set for the page to be > > > selected. > > > > > > > responsibilities. I suggest to rework that to: > > > > 1. negated_flags: page flags which are to be negated before applying > > > > the page selection using following masks; > > > Sorry I'm unable to understand the negation (which is XOR?). Lets look at > > > the truth table: > > > Page Flag negated_flags > > > 0 0 0 > > > 0 1 1 > > > 1 0 1 > > > 1 1 0 > > > > > > If a page flag is 0 and negated_flag is 1, the result would be 1 which has > > > changed the page flag. It isn't making sense to me. Why the page flag bit > > > is being fliped? > > > > > > When Anrdei had proposed these masks, they seemed like a fancy way of > > > filtering inside kernel and it was straight forward to understand. These > > > masks would help his use cases for CRIU. So I'd included it. Please can you > > > elaborate what is the purpose of negation? > > > > The XOR is a way to invert the tested value of a flag (from positive > > to negative and the other way) without having the API with invalid > > values (with required_flags and excluded_flags you need to define a > > rule about what happens if a flag is present in both of the masks - > > either prioritise one mask over the other or reject the call). > > (Note: the XOR is applied only to the value of the flags for the > > purpose of testing page-selection criteria.) > > Michał, > > Your API isn't much different from the current one, but it requires > a bit more brain activity for understanding. > > The current set of masks can be easy translated to the new one: > negated_flags = excluded_flags > required_flags_new = excluded_flags | required_flags > > As for invalid values, I think it is an advantage of the current API. > I mean we can easily detect invalid values and return EINVAL. With your > API, such mistakes will be undetectable. > > As for priorities, I don't see this problem here If I don't miss something. > > We can rewrite the code this way: > ``` > if (required_mask && ((page_flags & required_mask) != required_mask) > skip page; > if (anyof_mask && !(page_flags & anyof_mask)) > skip page; > if (page_flags & excluded_mask) > skip page; > ``` > > I think the result is always the same no matter in what order each > mask is applied. Hi, I would not want the discussion to wander into easier/harder territory as that highty depends on experience one has. What I'm arguing about is the consistency of the API. Let me expand a bit on that. We have two ways to look at the page_flags: A. the field represents a *set of elements* (tags, attributes) present on the page; B. the field represents a bitfield (structure; a fixed set of boolean fields having a value of 0 or 1) From A follows the include/exclude way of API design for matching the flags, and from B the matched mask (which flags to check) + value set (what values to require). My argument is that B is consistent with how the flags are used in the kernel: we don't have operations that add or remove flags, but we have operations that set or change their value. Best Regards Michał Mirosław
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote: > From my experience with UFFD, proper ordering of events is crucial, although it > is not always done well. Therefore, we should aim for improvement, not > regression. I believe that utilizing the pagemap-based mechanism for WP'ing > might be a step in the wrong direction. I think that it would have been better > to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and > events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the > file descriptor unless the log is full. Yes this is an interesting question to think about.. Keeping the data in the pgtable has one good thing that it doesn't need any complexity on maintaining the log, and no possibility of "log full". If there's possible "log full" then the next question is whether we should let the worker wait the monitor if the monitor is not fast enough to collect those data. It adds some slight dependency on the two threads, I think it can make the tracking harder or impossible in latency sensitive workloads. The other thing is we can also make the log "never gonna full" by making it a bitmap covering any registered ranges, but I don't either know whether it'll be worth it for the effort. Thanks,
> On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote: > > !! External Email > > On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote: >> From my experience with UFFD, proper ordering of events is crucial, although it >> is not always done well. Therefore, we should aim for improvement, not >> regression. I believe that utilizing the pagemap-based mechanism for WP'ing >> might be a step in the wrong direction. I think that it would have been better >> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and >> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the >> file descriptor unless the log is full. > > Yes this is an interesting question to think about.. > > Keeping the data in the pgtable has one good thing that it doesn't need any > complexity on maintaining the log, and no possibility of "log full". I understand your concern, but I think that eventually it might be simpler to maintain, since the logic of how to process the log is moved to userspace. At the same time, handling inputs from pagemap and uffd handlers and sync’ing them would not be too easy for userspace. But yes, allocation on the heap for userfaultfd_wait_queue-like entries would be needed, and there are some issues of ordering the events (I think all #PF and other events should be ordered regardless) and how not to traverse all async-userfaultfd_wait_queue’s (except those that block if the log is full) when a wakeup is needed. > > If there's possible "log full" then the next question is whether we should > let the worker wait the monitor if the monitor is not fast enough to > collect those data. It adds some slight dependency on the two threads, I > think it can make the tracking harder or impossible in latency sensitive > workloads. Again, I understand your concern. But this model that I propose is not new. It is used with PML (page-modification logging) and KVM, and IIRC there is a similar interface between KVM and QEMU to provide this information. There are endless other examples for similar producer-consumer mechanisms that might lead to stall in extreme cases. > > The other thing is we can also make the log "never gonna full" by making it > a bitmap covering any registered ranges, but I don't either know whether > it'll be worth it for the effort. I do not see a benefit of half-log half-scan. It tries to take the data-structure of one format and combine it with another. Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the threads of previous versions and I did not see userspace components that use the API to say something smart. Personally, I do not find the current API proposal to be very consistent and simple, and it seems to me that it lets pagemap do userfaultfd-related tasks, which might be considered inappropriate and non-intuitive. If I derailed the discussion, I apologize.
On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote: > > > > On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote: > > > > !! External Email > > > > On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote: > >> From my experience with UFFD, proper ordering of events is crucial, although it > >> is not always done well. Therefore, we should aim for improvement, not > >> regression. I believe that utilizing the pagemap-based mechanism for WP'ing > >> might be a step in the wrong direction. I think that it would have been better > >> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and > >> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the > >> file descriptor unless the log is full. > > > > Yes this is an interesting question to think about.. > > > > Keeping the data in the pgtable has one good thing that it doesn't need any > > complexity on maintaining the log, and no possibility of "log full". > > I understand your concern, but I think that eventually it might be simpler > to maintain, since the logic of how to process the log is moved to userspace. > > At the same time, handling inputs from pagemap and uffd handlers and sync’ing > them would not be too easy for userspace. I do not expect a common uffd-wp async user to provide a fault handler at all. In my imagination it's in most cases used standalone from other uffd modes; it means all the faults will still be handled by the kernel. Here we only leverage the accuracy of userfaultfd comparing to soft-dirty, so not really real "user"-faults. > > But yes, allocation on the heap for userfaultfd_wait_queue-like entries would > be needed, and there are some issues of ordering the events (I think all #PF > and other events should be ordered regardless) and how not to traverse all > async-userfaultfd_wait_queue’s (except those that block if the log is full) > when a wakeup is needed. Will there be an ordering requirement for an async mode? Considering it should be async to whatever else, I would think it's not a problem, but maybe I missed something. > > > > > If there's possible "log full" then the next question is whether we should > > let the worker wait the monitor if the monitor is not fast enough to > > collect those data. It adds some slight dependency on the two threads, I > > think it can make the tracking harder or impossible in latency sensitive > > workloads. > > Again, I understand your concern. But this model that I propose is not new. > It is used with PML (page-modification logging) and KVM, and IIRC there is > a similar interface between KVM and QEMU to provide this information. There > are endless other examples for similar producer-consumer mechanisms that > might lead to stall in extreme cases. Yes, I'm not against thinking of using similar structures here. It's just that it's definitely more complicated on the interface, at least we need yet one more interface to setup the rings and define its interfaces. Note that although Muhammud is defining another new interface here too for pagemap, I don't think it's strictly needed for uffd-wp async mode. One can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap interface already. So what Muhammud is proposing here are two things to me: (1) uffd-wp async, plus (2) a new pagemap interface (which will closely work with (1) only if we need atomicity on get-dirty and reprotect). Defining new interface for uffd-wp async mode will be something extra, so IMHO besides the heap allocation on the rings, we need to also justify whether that is needed. That's why I think it's fine to go with what Muhammud proposed, because it's a minimum changeset at least for userfault to support an async mode, and anything else can be done on top if necessary. Going a bit back to the "lead to stall in extreme cases" above, just also want to mention that the VM use case is slightly different - dirty tracking is only heavily used during migration afaict, and it's a short period. Not a lot of people will complain performance degrades during that period because that's just rare. And, even without the ring the perf is really bad during migration anyway... Especially when huge pages are used to back the guest RAM. Here it's slightly different to me: it's about tracking dirty pages during any possible workload, and it can be monitored periodically and frequently. So IMHO stricter than a VM use case where migration is the only period to use it. > > > > > The other thing is we can also make the log "never gonna full" by making it > > a bitmap covering any registered ranges, but I don't either know whether > > it'll be worth it for the effort. > > I do not see a benefit of half-log half-scan. It tries to take the > data-structure of one format and combine it with another. What I'm saying here is not half-log / half-scan, but use a single bitmap to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it avoids any above "stall" issue. > > Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the > threads of previous versions and I did not see userspace components that > use the API to say something smart. Actually similar here. :) So I'm probably not the best one to describe what is the best to look as API. What I know is I think the new pagemap interface is welcomed by CRIU developers, so it may be something good with/without userfaultfd getting involved already. I see this as "let's add one more bit for uffd-wp" in the new interface only. Quotting some link I got from Muhammud before with CRIU usage: https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com > Personally, I do not find the current API proposal to be very consistent > and simple, and it seems to me that it lets pagemap do > userfaultfd-related tasks, which might be considered inappropriate and > non-intuitive. Yes, I agree. I just don't know what's the best way to avoid this. The issue here IIUC is Muhammud needs one operation to do what Windows does with getWriteWatch() API. It means we need to mix up GET and PROTECT in a single shot. If we want to use pagemap as GET, then no choice to PROTECT also here to me. I think it'll be the same to soft-dirty if it's used, it means we'll extend soft-dirty modifications from clear_refs to pagemap too which I also don't think it's as clean. > > If I derailed the discussion, I apologize. Not at all. I just wished you joined earlier!
> On Feb 28, 2023, at 7:55 AM, Peter Xu <peterx@redhat.com> wrote: > > !! External Email > > On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote: >> >> >>> On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote: >>> >>> !! External Email >>> >>> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote: >>>> From my experience with UFFD, proper ordering of events is crucial, although it >>>> is not always done well. Therefore, we should aim for improvement, not >>>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing >>>> might be a step in the wrong direction. I think that it would have been better >>>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and >>>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the >>>> file descriptor unless the log is full. >>> >>> Yes this is an interesting question to think about.. >>> >>> Keeping the data in the pgtable has one good thing that it doesn't need any >>> complexity on maintaining the log, and no possibility of "log full". >> >> I understand your concern, but I think that eventually it might be simpler >> to maintain, since the logic of how to process the log is moved to userspace. >> >> At the same time, handling inputs from pagemap and uffd handlers and sync’ing >> them would not be too easy for userspace. > > I do not expect a common uffd-wp async user to provide a fault handler at > all. In my imagination it's in most cases used standalone from other uffd > modes; it means all the faults will still be handled by the kernel. Here > we only leverage the accuracy of userfaultfd comparing to soft-dirty, so > not really real "user"-faults. If that is the only use-case, it might make sense. But I guess most users would most likely use some library (and not syscalls directly). So slightly complicating the API for better generality may be reasonable. > >> >> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would >> be needed, and there are some issues of ordering the events (I think all #PF >> and other events should be ordered regardless) and how not to traverse all >> async-userfaultfd_wait_queue’s (except those that block if the log is full) >> when a wakeup is needed. > > Will there be an ordering requirement for an async mode? Considering it > should be async to whatever else, I would think it's not a problem, but > maybe I missed something. You may be right, but I am not sure. I am still not sure what use-cases are targeted in this patch-set. For CRIU checkpoint use-case (when the app is not running), I guess the current interface makes sense. But if there are use-cases in which this you do care about UFFD-events this can become an issue. But even in some obvious use-cases, this might be the wrong interface for major performance issues. If we think about some incremental copying of modified pages (a-la pre-copy live-migration or to create point-in-time snapshots), it seems to me much more efficient for application to have a log than traversing all the page-tables. >> >>> >>> If there's possible "log full" then the next question is whether we should >>> let the worker wait the monitor if the monitor is not fast enough to >>> collect those data. It adds some slight dependency on the two threads, I >>> think it can make the tracking harder or impossible in latency sensitive >>> workloads. >> >> Again, I understand your concern. But this model that I propose is not new. >> It is used with PML (page-modification logging) and KVM, and IIRC there is >> a similar interface between KVM and QEMU to provide this information. There >> are endless other examples for similar producer-consumer mechanisms that >> might lead to stall in extreme cases. > > Yes, I'm not against thinking of using similar structures here. It's just > that it's definitely more complicated on the interface, at least we need > yet one more interface to setup the rings and define its interfaces. > > Note that although Muhammud is defining another new interface here too for > pagemap, I don't think it's strictly needed for uffd-wp async mode. One > can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap > interface already. > > So what Muhammud is proposing here are two things to me: (1) uffd-wp async, > plus (2) a new pagemap interface (which will closely work with (1) only if > we need atomicity on get-dirty and reprotect). > > Defining new interface for uffd-wp async mode will be something extra, so > IMHO besides the heap allocation on the rings, we need to also justify > whether that is needed. That's why I think it's fine to go with what > Muhammud proposed, because it's a minimum changeset at least for userfault > to support an async mode, and anything else can be done on top if necessary. > > Going a bit back to the "lead to stall in extreme cases" above, just also > want to mention that the VM use case is slightly different - dirty tracking > is only heavily used during migration afaict, and it's a short period. Not > a lot of people will complain performance degrades during that period > because that's just rare. And, even without the ring the perf is really > bad during migration anyway... Especially when huge pages are used to back > the guest RAM. > > Here it's slightly different to me: it's about tracking dirty pages during > any possible workload, and it can be monitored periodically and frequently. > So IMHO stricter than a VM use case where migration is the only period to > use it. I still don’t get the use-cases. "monitored periodically and frequently” is not a use-case. And as I said before, actually, monitoring frequently is more performant with a log than with scanning all the page-tables. > >> >>> >>> The other thing is we can also make the log "never gonna full" by making it >>> a bitmap covering any registered ranges, but I don't either know whether >>> it'll be worth it for the effort. >> >> I do not see a benefit of half-log half-scan. It tries to take the >> data-structure of one format and combine it with another. > > What I'm saying here is not half-log / half-scan, but use a single bitmap > to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it > avoids any above "stall" issue. Oh, I never went into the KVM details before - stupid me. If that’s what eventually was proven to work for KVM/QEMU, then it really sounds like the pagemap solution that Muhammad proposed. But still not convoluting pagemap with userfaultfd (and especially uffd-wp) can be beneficial. Linus already threw some comments here and there about disliking uffd-wp, and I’m not sure adding uffd-wp specific stuff to pagemap would be welcomed. Anyhow, thanks for all the explanations. Eventually, I understand that using bitmaps can be more efficient than a log if the bits are condensed.
On Tue, Feb 28, 2023 at 05:21:20PM +0000, Nadav Amit wrote: > > > > On Feb 28, 2023, at 7:55 AM, Peter Xu <peterx@redhat.com> wrote: > > > > !! External Email > > > > On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote: > >> > >> > >>> On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote: > >>> > >>> !! External Email > >>> > >>> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote: > >>>> From my experience with UFFD, proper ordering of events is crucial, although it > >>>> is not always done well. Therefore, we should aim for improvement, not > >>>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing > >>>> might be a step in the wrong direction. I think that it would have been better > >>>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and > >>>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the > >>>> file descriptor unless the log is full. > >>> > >>> Yes this is an interesting question to think about.. > >>> > >>> Keeping the data in the pgtable has one good thing that it doesn't need any > >>> complexity on maintaining the log, and no possibility of "log full". > >> > >> I understand your concern, but I think that eventually it might be simpler > >> to maintain, since the logic of how to process the log is moved to userspace. > >> > >> At the same time, handling inputs from pagemap and uffd handlers and sync’ing > >> them would not be too easy for userspace. > > > > I do not expect a common uffd-wp async user to provide a fault handler at > > all. In my imagination it's in most cases used standalone from other uffd > > modes; it means all the faults will still be handled by the kernel. Here > > we only leverage the accuracy of userfaultfd comparing to soft-dirty, so > > not really real "user"-faults. > > If that is the only use-case, it might make sense. But I guess most users would > most likely use some library (and not syscalls directly). So slightly > complicating the API for better generality may be reasonable. > > > > >> > >> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would > >> be needed, and there are some issues of ordering the events (I think all #PF > >> and other events should be ordered regardless) and how not to traverse all > >> async-userfaultfd_wait_queue’s (except those that block if the log is full) > >> when a wakeup is needed. > > > > Will there be an ordering requirement for an async mode? Considering it > > should be async to whatever else, I would think it's not a problem, but > > maybe I missed something. > > You may be right, but I am not sure. I am still not sure what use-cases are > targeted in this patch-set. For CRIU checkpoint use-case (when the app is > not running), I guess the current interface makes sense. But if there are > use-cases in which this you do care about UFFD-events this can become an > issue. > > But even in some obvious use-cases, this might be the wrong interface for > major performance issues. If we think about some incremental copying of > modified pages (a-la pre-copy live-migration or to create point-in-time > snapshots), it seems to me much more efficient for application to have a > log than traversing all the page-tables. IMHO snapshots may not need a log at all - it needs CoW before the write happens. Nor is the case for swapping with userfaults, IIUC. IOW in those cases people don't care which page got dirtied, but care on data not being modified until the app allows it to. But I get the point, and I agree collecting by scanning is slower. > > > >> > >>> > >>> If there's possible "log full" then the next question is whether we should > >>> let the worker wait the monitor if the monitor is not fast enough to > >>> collect those data. It adds some slight dependency on the two threads, I > >>> think it can make the tracking harder or impossible in latency sensitive > >>> workloads. > >> > >> Again, I understand your concern. But this model that I propose is not new. > >> It is used with PML (page-modification logging) and KVM, and IIRC there is > >> a similar interface between KVM and QEMU to provide this information. There > >> are endless other examples for similar producer-consumer mechanisms that > >> might lead to stall in extreme cases. > > > > Yes, I'm not against thinking of using similar structures here. It's just > > that it's definitely more complicated on the interface, at least we need > > yet one more interface to setup the rings and define its interfaces. > > > > Note that although Muhammud is defining another new interface here too for > > pagemap, I don't think it's strictly needed for uffd-wp async mode. One > > can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap > > interface already. > > > > So what Muhammud is proposing here are two things to me: (1) uffd-wp async, > > plus (2) a new pagemap interface (which will closely work with (1) only if > > we need atomicity on get-dirty and reprotect). > > > > Defining new interface for uffd-wp async mode will be something extra, so > > IMHO besides the heap allocation on the rings, we need to also justify > > whether that is needed. That's why I think it's fine to go with what > > Muhammud proposed, because it's a minimum changeset at least for userfault > > to support an async mode, and anything else can be done on top if necessary. > > > > Going a bit back to the "lead to stall in extreme cases" above, just also > > want to mention that the VM use case is slightly different - dirty tracking > > is only heavily used during migration afaict, and it's a short period. Not > > a lot of people will complain performance degrades during that period > > because that's just rare. And, even without the ring the perf is really > > bad during migration anyway... Especially when huge pages are used to back > > the guest RAM. > > > > Here it's slightly different to me: it's about tracking dirty pages during > > any possible workload, and it can be monitored periodically and frequently. > > So IMHO stricter than a VM use case where migration is the only period to > > use it. > > I still don’t get the use-cases. "monitored periodically and frequently” is > not a use-case. And as I said before, actually, monitoring frequently is > more performant with a log than with scanning all the page-tables. Feel free to ignore this part if we're not taking about using a ring structure. My previous comment was mostly for that. Bitmaps won't have this issue. Here I see a bitmap as one way to implement a log, where it's recorded by one bit per page. My comment was that we should be careful on using rings. Side note: actually kvm dirty ring is even trickier; see the soft-full (kvm_dirty_ring.soft_limit) besides the hard-full event to make sure hard-full won't really trigger (or we're prone to lose dirty bits). I don't think we'll have the same issue here so we can trigger hard-full, but it's still unwanted to halt the threads being tracked for dirty pages. I don't know whether there'll be other side effects by the ring, though.. > > > > >> > >>> > >>> The other thing is we can also make the log "never gonna full" by making it > >>> a bitmap covering any registered ranges, but I don't either know whether > >>> it'll be worth it for the effort. > >> > >> I do not see a benefit of half-log half-scan. It tries to take the > >> data-structure of one format and combine it with another. > > > > What I'm saying here is not half-log / half-scan, but use a single bitmap > > to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it > > avoids any above "stall" issue. > > Oh, I never went into the KVM details before - stupid me. If that’s what > eventually was proven to work for KVM/QEMU, then it really sounds like > the pagemap solution that Muhammad proposed. > > But still not convoluting pagemap with userfaultfd (and especially > uffd-wp) can be beneficial. Linus already threw some comments here and > there about disliking uffd-wp, and I’m not sure adding uffd-wp specific > stuff to pagemap would be welcomed. Yes I also don't know.. As I mentioned I'm not super happy with the interface either, but that's the simplest I can think of so far. IOW, from an "userfaultfd-side reviewer" POV I'm fine if someone wants to leverage the concepts of uffd-wp and its internals using a separate but very light weighted patch just to impl async mode of uffd-wp. But I'm always open to any suggestions too. It's just that when there're multiple options and when we're not confident on either way, I normally prefer the simplest and cleanest (even if less efficient). > Anyhow, thanks for all the explanations. Eventually, I understand that > using bitmaps can be more efficient than a log if the bits are condensed. Note that I think what Muhammad (sorry, Muhammad! I think I spelled your name wrongly before starting from some email..) proposed is not a bitmap, but an array of ranges that can coalesce the result into very condensed form. Pros and cons. Again, I can't comment much on that API, but since there're a bunch of other developers looking at that and they're also potential future users, I'll trust their judgement and just focus more on the other side of things. Thanks,
> On Feb 28, 2023, at 11:31 AM, Peter Xu <peterx@redhat.com> wrote: > >> Anyhow, thanks for all the explanations. Eventually, I understand that >> using bitmaps can be more efficient than a log if the bits are condensed. > > Note that I think what Muhammad (sorry, Muhammad! I think I spelled your > name wrongly before starting from some email..) proposed is not a bitmap, > but an array of ranges that can coalesce the result into very condensed > form. Pros and cons. > > Again, I can't comment much on that API, but since there're a bunch of > other developers looking at that and they're also potential future users, > I'll trust their judgement and just focus more on the other side of things. Thanks Peter for your patience. I would just note that I understood that Muhammad did not propose a condensed bitmap, and that was a hint that handling a condensed bitmap (at least on x86) can be done rather efficiently. I am not sure about other representations. Thanks for your explanations again, Peter.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{ + if ((pte_present(pte) && pte_uffd_wp(pte)) || + (pte_swp_uffd_wp_any(pte))) + return true; + return false; +} + +static inline bool is_pmd_uffd_wp(pmd_t pmd) +{ + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) + return true; + return false; +} + #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) + +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ + (wt | file << 1 | present << 2 | swap << 3) +#define IS_WT_REQUIRED(a) \ + ((a->required_mask & PAGE_IS_WRITTEN) || \ + (a->anyof_mask & PAGE_IS_WRITTEN)) + +struct pagemap_scan_private { + struct page_region *vec; + struct page_region prev; + unsigned long vec_len, vec_index; + unsigned int max_pages, found_pages, flags; + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; +}; + +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) + return -EPERM; + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +} + +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, + struct pagemap_scan_private *p, unsigned long addr, + unsigned int len) +{ + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); + bool cpy = true; + struct page_region *prev = &p->prev; + + if (HAS_NO_SPACE(p)) + return -ENOSPC; + + if (p->max_pages && p->found_pages + len >= p->max_pages) + len = p->max_pages - p->found_pages; + if (!len) + return -EINVAL; + + if (p->required_mask) + cpy = ((p->required_mask & cur) == p->required_mask); + if (cpy && p->anyof_mask) + cpy = (p->anyof_mask & cur); + if (cpy && p->excluded_mask) + cpy = !(p->excluded_mask & cur); + bitmap = cur & p->return_mask; + if (cpy && bitmap) { + if ((prev->len) && (prev->bitmap == bitmap) && + (prev->start + prev->len * PAGE_SIZE == addr)) { + prev->len += len; + p->found_pages += len; + } else if (p->vec_index < p->vec_len) { + if (prev->len) { + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); + p->vec_index++; + } + prev->start = addr; + prev->len = len; + prev->bitmap = bitmap; + p->found_pages += len; + } else { + return -ENOSPC; + } + } + return 0; +} + +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, + unsigned long *vec_index) +{ + struct page_region *prev = &p->prev; + + if (prev->len) { + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) + return -EFAULT; + p->vec_index++; + (*vec_index)++; + prev->len = 0; + } + return 0; +} + +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, + unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + unsigned long addr = end; + spinlock_t *ptl; + int ret = 0; + pte_t *pte; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + bool pmd_wt; + + pmd_wt = !is_pmd_uffd_wp(*pmd); + /* + * Break huge page into small pages if operation needs to be performed is + * on a portion of the huge page. + */ + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { + spin_unlock(ptl); + split_huge_pmd(vma, pmd, start); + goto process_smaller_pages; + } + if (IS_GET_OP(p)) + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), + is_swap_pmd(*pmd), p, start, + (end - start)/PAGE_SIZE); + spin_unlock(ptl); + if (!ret) { + if (pmd_wt && IS_WP_ENGAGE_OP(p)) + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); + } + return ret; + } +process_smaller_pages: + if (pmd_trans_unstable(pmd)) + return 0; +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + if (IS_GET_OP(p)) { + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); + if (ret) + break; + } + } + pte_unmap_unlock(pte - 1, ptl); + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) + uffd_wp_range(walk->mm, vma, start, addr - start, true); + + cond_resched(); + return ret; +} + +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, + struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + int ret = 0; + + if (vma) + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, + (end - addr)/PAGE_SIZE); + return ret; +} + +/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = { + .test_walk = pagemap_scan_test_walk, + .pmd_entry = pagemap_scan_pmd_entry, + .pte_hole = pagemap_scan_pte_hole, +}; + +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{ + unsigned long empty_slots, vec_index = 0; + unsigned long __user start, end; + unsigned long __start, __end; + struct page_region __user *vec; + struct pagemap_scan_private p; + int ret = 0; + + start = (unsigned long)untagged_addr(arg->start); + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); + + /* Validate memory ranges */ + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) + return -EINVAL; + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) + return -EINVAL; + + /* Detect illegal flags and masks */ + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || + (arg->return_mask & ~PAGEMAP_BITS_ALL)) + return -EINVAL; + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || + !arg->return_mask)) + return -EINVAL; + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) + return -EINVAL; + + end = start + arg->len; + p.max_pages = arg->max_pages; + p.found_pages = 0; + p.flags = arg->flags; + p.required_mask = arg->required_mask; + p.anyof_mask = arg->anyof_mask; + p.excluded_mask = arg->excluded_mask; + p.return_mask = arg->return_mask; + p.prev.len = 0; + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); + + if (IS_GET_OP(arg)) { + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); + if (!p.vec) + return -ENOMEM; + } else { + p.vec = NULL; + } + __start = __end = start; + while (!ret && __end < end) { + p.vec_index = 0; + empty_slots = arg->vec_len - vec_index; + if (p.vec_len > empty_slots) + p.vec_len = empty_slots; + + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; + if (__end > end) + __end = end; + + mmap_read_lock(mm); + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); + mmap_read_unlock(mm); + if (!(!ret || ret == -ENOSPC)) + goto free_data; + + __start = __end; + if (IS_GET_OP(arg) && p.vec_index) { + if (copy_to_user(&vec[vec_index], p.vec, + p.vec_index * sizeof(struct page_region))) { + ret = -EFAULT; + goto free_data; + } + vec_index += p.vec_index; + } + } + ret = export_prev_to_out(&p, vec, &vec_index); + if (!ret) + ret = vec_index; +free_data: + if (IS_GET_OP(arg)) + kfree(p.vec); + + return ret; +} + +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; + struct mm_struct *mm = file->private_data; + struct pagemap_scan_arg argument; + + if (cmd == PAGEMAP_SCAN) { + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) + return -EFAULT; + return do_pagemap_cmd(mm, &argument); + } + return -EINVAL; +} + const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release, + .unlocked_ioctl = pagemap_scan_ioctl, + .compat_ioctl = pagemap_scan_ioctl, }; #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) + +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3) + +/* + * struct page_region - Page region with bitmap flags + * @start: Start of the region + * @len: Length of the region + * bitmap: Bits sets for the region + */ +struct page_region { + __u64 start; + __u64 len; + __u64 bitmap; +}; + +/* + * struct pagemap_scan_arg - Pagemap ioctl argument + * @start: Starting address of the region + * @len: Length of the region (All the pages in this length are included) + * @vec: Address of page_region struct array for output + * @vec_len: Length of the page_region struct array + * @max_pages: Optional max return pages + * @flags: Flags for the IOCTL + * @required_mask: Required mask - All of these bits have to be set in the PTE + * @anyof_mask: Any mask - Any of these bits are set in the PTE + * @excluded_mask: Exclude mask - None of these bits are set in the PTE + * @return_mask: Bits that are to be reported in page_region + */ +struct pagemap_scan_arg { + __u64 start; + __u64 len; + __u64 vec; + __u64 vec_len; + __u32 max_pages; + __u32 flags; + __u64 required_mask; + __u64 anyof_mask; + __u64 excluded_mask; + __u64 return_mask; +}; + +/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0) + #endif /* _UAPI_LINUX_FS_H */
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to. - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order: - The userfaultfd file descriptor is created with userfaultfd syscall. - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL. Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL. struct pagemap_scan_args is used as the argument of the IOCTL. In this struct: - The range is specified through start and len. - The output buffer of struct page_region array and size is specified as vec and vec_len. - The optional maximum requested pages are specified in the max_pages. - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time. - The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask. This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi. Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> --- Changes in v10: - move changes in tools/include/uapi/linux/fs.h to separate patch - update commit message Change in v8: - Correct is_pte_uffd_wp() - Improve readability and error checks - Remove some un-needed code Changes in v7: - Rebase on top of latest next - Fix some corner cases - Base soft-dirty on the uffd wp async - Update the terminologies - Optimize the memory usage inside the ioctl Changes in v6: - Rename variables and update comments - Make IOCTL independent of soft_dirty config - Change masks and bitmap type to _u64 - Improve code quality Changes in v5: - Remove tlb flushing even for clear operation Changes in v4: - Update the interface and implementation Changes in v3: - Tighten the user-kernel interface by using explicit types and add more error checking Changes in v2: - Convert the interface from syscall to ioctl - Remove pidfd support as it doesn't make sense in ioctl --- fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)