diff mbox series

[V2] mm: Recheck page table entry with page table lock held

Message ID 20180926031858.9692-1-aneesh.kumar@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [V2] mm: Recheck page table entry with page table lock held | expand

Commit Message

Aneesh Kumar K.V Sept. 26, 2018, 3:18 a.m. UTC
We clear the pte temporarily during read/modify/write update of the pte. If we
take a page fault while the pte is cleared, the application can get SIGBUS. One
such case is with remap_pfn_range without a backing vm_ops->fault callback.
do_fault will return SIGBUS in that case.

cpu 0		 				cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
.						page fault.
.
.
prep_modify_prot_commit()

Fix this by taking page table lock and rechecking for pte_none.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
V1:
* update commit message.

 mm/memory.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

Comments

Kirill A. Shutemov Sept. 26, 2018, 12:40 p.m. UTC | #1
On Wed, Sep 26, 2018 at 08:48:58AM +0530, Aneesh Kumar K.V wrote:
> We clear the pte temporarily during read/modify/write update of the pte. If we
> take a page fault while the pte is cleared, the application can get SIGBUS. One
> such case is with remap_pfn_range without a backing vm_ops->fault callback.
> do_fault will return SIGBUS in that case.
> 
> cpu 0		 				cpu1
> mprotect()
> ptep_modify_prot_start()/pte cleared.
> .
> .						page fault.
> .
> .
> prep_modify_prot_commit()
> 
> Fix this by taking page table lock and rechecking for pte_none.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> V1:
> * update commit message.

You choosed to stick with VM_FAULT_NOPAGE, that's fine.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Should it be in stable?
Figo.zhang Oct. 25, 2019, 3:13 a.m. UTC | #2
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> 于2018年9月26日周三 上午11:19写道:

> We clear the pte temporarily during read/modify/write update of the pte.
> If we
> take a page fault while the pte is cleared, the application can get
> SIGBUS. One
> such case is with remap_pfn_range without a backing vm_ops->fault callback.
> do_fault will return SIGBUS in that case.
>
what is " remap_pfn_range without a backing vm_ops->fault callback ", would
you like  elaborate the scenario?
 is it the case using remap_pfn_range()  in drivers mmap() file operations?
if in that case, why it will trap into do_fault?

>
> cpu 0                                           cpu1
> mprotect()
> ptep_modify_prot_start()/pte cleared.
> .
> .                                               page fault.
> .
> .
> prep_modify_prot_commit()


  i am confusing this  scenario, when CPU0 will call
in change_pte_range()->ptep_modify_prot_start() to clear the pte content,
and
on the other thread, in handle_pte_fault(), pte_offset_map() can get the
pte, and the pte is not invalid, it's pte is valid but just the content is
all zero, so why it will call into do_fault?

in  handle_pte_fault():
    vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
    if (!vmf->pte) {
            return do_fault(vmf);
    }



>
>

> Fix this by taking page table lock and rechecking for pte_none.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> V1:
> * update commit message.
>
>  mm/memory.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c467102a5cbc..c2f933184303 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3745,10 +3745,33 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         vm_fault_t ret;
>
> -       /* The VMA was not fully populated on mmap() or missing
> VM_DONTEXPAND */
> -       if (!vma->vm_ops->fault)
> -               ret = VM_FAULT_SIGBUS;
> -       else if (!(vmf->flags & FAULT_FLAG_WRITE))
> +       /*
> +        * The VMA was not fully populated on mmap() or missing
> VM_DONTEXPAND
> +        */
> +       if (!vma->vm_ops->fault) {
> +
> +               /*
> +                * pmd entries won't be marked none during a R/M/W cycle.
> +                */
> +               if (unlikely(pmd_none(*vmf->pmd)))
> +                       ret = VM_FAULT_SIGBUS;
> +               else {
> +                       vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +                       /*
> +                        * Make sure this is not a temporary clearing of
> pte
> +                        * by holding ptl and checking again. A R/M/W
> update
> +                        * of pte involves: take ptl, clearing the pte so
> that
> +                        * we don't have concurrent modification by
> hardware
> +                        * followed by an update.
> +                        */
> +                       spin_lock(vmf->ptl);
> +                       if (unlikely(pte_none(*vmf->pte)))
> +                               ret = VM_FAULT_SIGBUS;
> +                       else
> +                               ret = VM_FAULT_NOPAGE;
> +                       spin_unlock(vmf->ptl);
> +               }
> +       } else if (!(vmf->flags & FAULT_FLAG_WRITE))
>                 ret = do_read_fault(vmf);
>         else if (!(vma->vm_flags & VM_SHARED))
>                 ret = do_cow_fault(vmf);
> --
> 2.17.1
>
>
Kirill A. Shutemov Oct. 28, 2019, 12:08 p.m. UTC | #3
On Fri, Oct 25, 2019 at 11:13:58AM +0800, Figo.zhang wrote:
> Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> 于2018年9月26日周三 上午11:19写道:
> 
> > We clear the pte temporarily during read/modify/write update of the pte.
> > If we
> > take a page fault while the pte is cleared, the application can get
> > SIGBUS. One
> > such case is with remap_pfn_range without a backing vm_ops->fault callback.
> > do_fault will return SIGBUS in that case.
> >
> what is " remap_pfn_range without a backing vm_ops->fault callback ", would
> you like  elaborate the scenario?
>  is it the case using remap_pfn_range()  in drivers mmap() file operations?
> if in that case, why it will trap into do_fault?

Because there's no page mapped there during the race.

> >
> > cpu 0                                           cpu1
> > mprotect()
> > ptep_modify_prot_start()/pte cleared.
> > .
> > .                                               page fault.
> > .
> > .
> > prep_modify_prot_commit()
> 
> 
>   i am confusing this  scenario, when CPU0 will call
> in change_pte_range()->ptep_modify_prot_start() to clear the pte content,
> and
> on the other thread, in handle_pte_fault(), pte_offset_map() can get the
> pte, and the pte is not invalid, it's pte is valid but just the content is
> all zero, so why it will call into do_fault?
> 
> in  handle_pte_fault():
>     vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>     if (!vmf->pte) {
>             return do_fault(vmf);
>     }

This case handles the situation when pte is none (clear) or page table is
not allocated at all.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..c2f933184303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3745,10 +3745,33 @@  static vm_fault_t do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
-	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
-	if (!vma->vm_ops->fault)
-		ret = VM_FAULT_SIGBUS;
-	else if (!(vmf->flags & FAULT_FLAG_WRITE))
+	/*
+	 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
+	 */
+	if (!vma->vm_ops->fault) {
+
+		/*
+		 * pmd entries won't be marked none during a R/M/W cycle.
+		 */
+		if (unlikely(pmd_none(*vmf->pmd)))
+			ret = VM_FAULT_SIGBUS;
+		else {
+			vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+			/*
+			 * Make sure this is not a temporary clearing of pte
+			 * by holding ptl and checking again. A R/M/W update
+			 * of pte involves: take ptl, clearing the pte so that
+			 * we don't have concurrent modification by hardware
+			 * followed by an update.
+			 */
+			spin_lock(vmf->ptl);
+			if (unlikely(pte_none(*vmf->pte)))
+				ret = VM_FAULT_SIGBUS;
+			else
+				ret = VM_FAULT_NOPAGE;
+			spin_unlock(vmf->ptl);
+		}
+	} else if (!(vmf->flags & FAULT_FLAG_WRITE))
 		ret = do_read_fault(vmf);
 	else if (!(vma->vm_flags & VM_SHARED))
 		ret = do_cow_fault(vmf);