Message ID | 20210501144110.8784-2-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/hugetlb: Fix issues on file sealing and fork | expand |
On 5/1/21 7:41 AM, Peter Xu wrote: > F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day. > There is a test program for that and it fails constantly. > > $ ./memfd_test hugetlbfs > memfd-hugetlb: CREATE > memfd-hugetlb: BASIC > memfd-hugetlb: SEAL-WRITE > memfd-hugetlb: SEAL-FUTURE-WRITE > mmap() didn't fail as expected > Aborted (core dumped) > > I think it's probably because no one is really running the hugetlbfs test. > > Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in > shmem_mmap(). Generalize a helper for that. > > Reported-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > fs/hugetlbfs/inode.c | 5 +++++ > include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ > mm/shmem.c | 22 ++++------------------ > 3 files changed, 41 insertions(+), 18 deletions(-) Thanks Peter and Hugh! One question below, > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a2a42335e8fd2..39922c0f2fc8c 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec) > static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > { > struct inode *inode = file_inode(file); > + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); > loff_t len, vma_len; > int ret; > struct hstate *h = hstate_file(file); > > + ret = seal_check_future_write(info->seals, vma); > + if (ret) > + return ret; > + > /* > * vma address alignment (but not the pgoff alignment) has > * already been checked by prepare_hugepage_range. If you add The full comment below the code you added is: /* * vma address alignment (but not the pgoff alignment) has * already been checked by prepare_hugepage_range. If you add * any error returns here, do so after setting VM_HUGETLB, so * is_vm_hugetlb_page tests below unmap_region go the right * way when do_mmap unwinds (may be important on powerpc * and ia64). */ This comment was added in commit 68589bc35303 by Hugh, although it appears David Gibson added the reason for the comment in the commit message: "If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example, because the given file offset is not hugepage aligned - then do_mmap_pgoff will go to the unmap_and_free_vma backout path. But at this stage the vma hasn't been marked as hugepage, and the backout path will call unmap_region() on it. That will eventually call down to the non-hugepage version of unmap_page_range(). On ppc64, at least, that will cause serious problems if there are any existing hugepage pagetable entries in the vicinity - for example if there are any other hugepage mappings under the same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud entries. I suspect this will also cause bad problems on ia64, though I don't have a machine to test it on." There are still comments in the unmap code about special handling of ppc64 PUDs. So, this may still be an issue. I am trying to dig into the code to determine if this is still and issue. Just curious if you looked into this? Might be simpler and safer to just put the seal check after setting the VM_HUGETLB flag? -- Mike Kravetz
Mike, On Mon, May 03, 2021 at 11:55:41AM -0700, Mike Kravetz wrote: > On 5/1/21 7:41 AM, Peter Xu wrote: > > F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day. > > There is a test program for that and it fails constantly. > > > > $ ./memfd_test hugetlbfs > > memfd-hugetlb: CREATE > > memfd-hugetlb: BASIC > > memfd-hugetlb: SEAL-WRITE > > memfd-hugetlb: SEAL-FUTURE-WRITE > > mmap() didn't fail as expected > > Aborted (core dumped) > > > > I think it's probably because no one is really running the hugetlbfs test. > > > > Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in > > shmem_mmap(). Generalize a helper for that. > > > > Reported-by: Hugh Dickins <hughd@google.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > fs/hugetlbfs/inode.c | 5 +++++ > > include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ > > mm/shmem.c | 22 ++++------------------ > > 3 files changed, 41 insertions(+), 18 deletions(-) > > Thanks Peter and Hugh! > > One question below, > > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index a2a42335e8fd2..39922c0f2fc8c 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec) > > static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > > { > > struct inode *inode = file_inode(file); > > + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); > > loff_t len, vma_len; > > int ret; > > struct hstate *h = hstate_file(file); > > > > + ret = seal_check_future_write(info->seals, vma); > > + if (ret) > > + return ret; > > + > > /* > > * vma address alignment (but not the pgoff alignment) has > > * already been checked by prepare_hugepage_range. If you add > > The full comment below the code you added is: > > /* > * vma address alignment (but not the pgoff alignment) has > * already been checked by prepare_hugepage_range. If you add > * any error returns here, do so after setting VM_HUGETLB, so > * is_vm_hugetlb_page tests below unmap_region go the right > * way when do_mmap unwinds (may be important on powerpc > * and ia64). > */ > > This comment was added in commit 68589bc35303 by Hugh, although it > appears David Gibson added the reason for the comment in the commit > message: > > "If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example, > because the given file offset is not hugepage aligned - then do_mmap_pgoff > will go to the unmap_and_free_vma backout path. > > But at this stage the vma hasn't been marked as hugepage, and the backout path > will call unmap_region() on it. That will eventually call down to the > non-hugepage version of unmap_page_range(). On ppc64, at least, that will > cause serious problems if there are any existing hugepage pagetable entries in > the vicinity - for example if there are any other hugepage mappings under the > same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud > entries. I suspect this will also cause bad problems on ia64, though I don't > have a machine to test it on." > > There are still comments in the unmap code about special handling of > ppc64 PUDs. So, this may still be an issue. > > I am trying to dig into the code to determine if this is still and > issue. Just curious if you looked into this? Might be simpler and > safer to just put the seal check after setting the VM_HUGETLB flag? Good catch! I overlooked on that, and I definitely didn't look into it yet. For now I'd better move that check to be after the flag settings in all cases. I'll also add: Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd") Thanks,
On 5/3/21 2:31 PM, Peter Xu wrote: > Mike, > > On Mon, May 03, 2021 at 11:55:41AM -0700, Mike Kravetz wrote: >> On 5/1/21 7:41 AM, Peter Xu wrote: >>> F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day. >>> There is a test program for that and it fails constantly. >>> >>> $ ./memfd_test hugetlbfs >>> memfd-hugetlb: CREATE >>> memfd-hugetlb: BASIC >>> memfd-hugetlb: SEAL-WRITE >>> memfd-hugetlb: SEAL-FUTURE-WRITE >>> mmap() didn't fail as expected >>> Aborted (core dumped) >>> >>> I think it's probably because no one is really running the hugetlbfs test. >>> >>> Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in >>> shmem_mmap(). Generalize a helper for that. >>> >>> Reported-by: Hugh Dickins <hughd@google.com> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> fs/hugetlbfs/inode.c | 5 +++++ >>> include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ >>> mm/shmem.c | 22 ++++------------------ >>> 3 files changed, 41 insertions(+), 18 deletions(-) >> >> Thanks Peter and Hugh! >> >> One question below, >> >>> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>> index a2a42335e8fd2..39922c0f2fc8c 100644 >>> --- a/fs/hugetlbfs/inode.c >>> +++ b/fs/hugetlbfs/inode.c >>> @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec) >>> static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) >>> { >>> struct inode *inode = file_inode(file); >>> + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); >>> loff_t len, vma_len; >>> int ret; >>> struct hstate *h = hstate_file(file); >>> >>> + ret = seal_check_future_write(info->seals, vma); >>> + if (ret) >>> + return ret; >>> + >>> /* >>> * vma address alignment (but not the pgoff alignment) has >>> * already been checked by prepare_hugepage_range. If you add >> >> The full comment below the code you added is: >> >> /* >> * vma address alignment (but not the pgoff alignment) has >> * already been checked by prepare_hugepage_range. If you add >> * any error returns here, do so after setting VM_HUGETLB, so >> * is_vm_hugetlb_page tests below unmap_region go the right >> * way when do_mmap unwinds (may be important on powerpc >> * and ia64). >> */ >> >> This comment was added in commit 68589bc35303 by Hugh, although it >> appears David Gibson added the reason for the comment in the commit >> message: >> >> "If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example, >> because the given file offset is not hugepage aligned - then do_mmap_pgoff >> will go to the unmap_and_free_vma backout path. >> >> But at this stage the vma hasn't been marked as hugepage, and the backout path >> will call unmap_region() on it. That will eventually call down to the >> non-hugepage version of unmap_page_range(). On ppc64, at least, that will >> cause serious problems if there are any existing hugepage pagetable entries in >> the vicinity - for example if there are any other hugepage mappings under the >> same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud >> entries. I suspect this will also cause bad problems on ia64, though I don't >> have a machine to test it on." >> >> There are still comments in the unmap code about special handling of >> ppc64 PUDs. So, this may still be an issue. >> >> I am trying to dig into the code to determine if this is still and >> issue. Just curious if you looked into this? Might be simpler and >> safer to just put the seal check after setting the VM_HUGETLB flag? > > Good catch! I overlooked on that, and I definitely didn't look into it yet. > For now I'd better move that check to be after the flag settings in all cases. > > I'll also add: > > Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd") > Thanks! With those changes, you can add, Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a2a42335e8fd2..39922c0f2fc8c 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec) static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file_inode(file); + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); loff_t len, vma_len; int ret; struct hstate *h = hstate_file(file); + ret = seal_check_future_write(info->seals, vma); + if (ret) + return ret; + /* * vma address alignment (but not the pgoff alignment) has * already been checked by prepare_hugepage_range. If you add diff --git a/include/linux/mm.h b/include/linux/mm.h index 84fb1697b20ff..c3fd7d504a60e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3200,5 +3200,37 @@ extern int sysctl_nr_trim_pages; void mem_dump_obj(void *object); +/** + * seal_check_future_write - Check for F_SEAL_FUTURE_WRITE flag and handle it + * @seals: the seals to check + * @vma: the vma to operate on + * + * Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on + * the vma flags. Return 0 if check pass, or <0 for errors. + */ +static inline int seal_check_future_write(int seals, struct vm_area_struct *vma) +{ + if (seals & F_SEAL_FUTURE_WRITE) { + /* + * New PROT_WRITE and MAP_SHARED mmaps are not allowed when + * "future write" seal active. + */ + if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) + return -EPERM; + + /* + * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as + * MAP_SHARED and read-only, take care to not allow mprotect to + * revert protections on such mappings. Do this only for shared + * mappings. For private mappings, don't need to mask + * VM_MAYWRITE as we still want them to be COW-writable. + */ + if (vma->vm_flags & VM_SHARED) + vma->vm_flags &= ~(VM_MAYWRITE); + } + + return 0; +} + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/mm/shmem.c b/mm/shmem.c index 26c76b13ad233..e86a230735b60 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2258,25 +2258,11 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user) static int shmem_mmap(struct file *file, struct vm_area_struct *vma) { struct shmem_inode_info *info = SHMEM_I(file_inode(file)); + int ret; - if (info->seals & F_SEAL_FUTURE_WRITE) { - /* - * New PROT_WRITE and MAP_SHARED mmaps are not allowed when - * "future write" seal active. - */ - if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) - return -EPERM; - - /* - * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as - * MAP_SHARED and read-only, take care to not allow mprotect to - * revert protections on such mappings. Do this only for shared - * mappings. For private mappings, don't need to mask - * VM_MAYWRITE as we still want them to be COW-writable. - */ - if (vma->vm_flags & VM_SHARED) - vma->vm_flags &= ~(VM_MAYWRITE); - } + ret = seal_check_future_write(info->seals, vma); + if (ret) + return ret; /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED;
F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day. There is a test program for that and it fails constantly. $ ./memfd_test hugetlbfs memfd-hugetlb: CREATE memfd-hugetlb: BASIC memfd-hugetlb: SEAL-WRITE memfd-hugetlb: SEAL-FUTURE-WRITE mmap() didn't fail as expected Aborted (core dumped) I think it's probably because no one is really running the hugetlbfs test. Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in shmem_mmap(). Generalize a helper for that. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- fs/hugetlbfs/inode.c | 5 +++++ include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ mm/shmem.c | 22 ++++------------------ 3 files changed, 41 insertions(+), 18 deletions(-)