Message ID | 20210527190453.1259020-1-pcc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] mm: improve mprotect(R|W) efficiency on pages referenced once | expand |
Peter, On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote: [...] > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > + unsigned long cp_flags) > +{ > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > + return false; > + > + if (page_count(pte_page(pte)) != 1) > + return false; > + } Can we make MM_CP_DIRTY_ACCT still in charge? IIUC that won't affect your use case, something like: /* Never apply trick if MM_CP_DIRTY_ACCT not set */ if (!(cp_flags & MM_CP_DIRTY_ACCT)) return false; The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name shows). Say, we should start to count on the dirty bit for applying the write bit only if that flag set. With above, I think we can drop the pte_uffd_wp() check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do change_protection(). Thanks, > + > + if (!pte_dirty(pte)) > + return false; > + > + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > + return false; > + > + if (pte_uffd_wp(pte)) > + return false; > + > + return true; > +} > + > static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, pgprot_t newprot, > unsigned long cp_flags) > @@ -43,7 +66,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > spinlock_t *ptl; > unsigned long pages = 0; > int target_node = NUMA_NO_NODE; > - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; > bool prot_numa = cp_flags & MM_CP_PROT_NUMA; > bool uffd_wp = cp_flags & MM_CP_UFFD_WP; > bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; > @@ -132,11 +154,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > } > > /* Avoid taking write faults for known dirty pages */ > - if (dirty_accountable && pte_dirty(ptent) && > - (pte_soft_dirty(ptent) || > - !(vma->vm_flags & VM_SOFTDIRTY))) { > + if (may_avoid_write_fault(ptent, vma, cp_flags)) > ptent = pte_mkwrite(ptent); > - } > ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); > pages++; > } else if (is_swap_pte(oldpte)) { > -- > 2.32.0.rc0.204.g9fa02ecfa5-goog >
On Thu, May 27, 2021 at 2:41 PM Peter Xu <peterx@redhat.com> wrote: > > Peter, > > On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote: > > [...] > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > + unsigned long cp_flags) > > +{ > > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > > + return false; > > + > > + if (page_count(pte_page(pte)) != 1) > > + return false; > > + } > > Can we make MM_CP_DIRTY_ACCT still in charge? IIUC that won't affect your use > case, something like: > > /* Never apply trick if MM_CP_DIRTY_ACCT not set */ > if (!(cp_flags & MM_CP_DIRTY_ACCT)) > return false; > > The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name > shows). Say, we should start to count on the dirty bit for applying the write > bit only if that flag set. With above, I think we can drop the pte_uffd_wp() > check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do > change_protection(). I don't think that would work. The anonymous pages that we're interesting in optimizing are private writable pages, for which vma_wants_writenotify(vma, vma->vm_page_prot) would return false (and thus MM_CP_DIRTY_ACCT would not be set, and thus your code would disable the optimization), because of this code at the top of vma_wants_writenotify: /* If it was private or non-writable, the write bit is already clear */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0; IIUC, dirty accountable is about whether we can always apply the optimization no matter what the ref count is, so it isn't suitable for situations where we need to check the ref count. Peter
On Thu, May 27, 2021 at 04:37:11PM -0700, Peter Collingbourne wrote: > On Thu, May 27, 2021 at 2:41 PM Peter Xu <peterx@redhat.com> wrote: > > > > Peter, > > > > On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote: > > > > [...] > > > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > > + unsigned long cp_flags) > > > +{ > > > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > > > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > > > + return false; > > > + > > > + if (page_count(pte_page(pte)) != 1) > > > + return false; > > > + } > > > > Can we make MM_CP_DIRTY_ACCT still in charge? IIUC that won't affect your use > > case, something like: > > > > /* Never apply trick if MM_CP_DIRTY_ACCT not set */ > > if (!(cp_flags & MM_CP_DIRTY_ACCT)) > > return false; > > > > The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name > > shows). Say, we should start to count on the dirty bit for applying the write > > bit only if that flag set. With above, I think we can drop the pte_uffd_wp() > > check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do > > change_protection(). > > I don't think that would work. The anonymous pages that we're > interesting in optimizing are private writable pages, for which > vma_wants_writenotify(vma, vma->vm_page_prot) would return false (and > thus MM_CP_DIRTY_ACCT would not be set, and thus your code would > disable the optimization), because of this code at the top of > vma_wants_writenotify: > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > IIUC, dirty accountable is about whether we can always apply the > optimization no matter what the ref count is, so it isn't suitable for > situations where we need to check the ref count. Ah I see.. Though it still looks weird e.g. the first check could have been done before calling change_protection()? diff --git a/mm/mprotect.c b/mm/mprotect.c index 96f4df023439..9270140afbbd 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -548,7 +548,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, * held in write mode. */ vma->vm_flags = newflags; - dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); + dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot) || + (vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)); vma_set_page_prot(vma); change_protection(vma, start, end, vma->vm_page_prot, Would something like this make the check even faster? Meanwhile when I'm looking at the rest I found I cannot understand the other check in this patch regarding soft dirty: + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) + return false; I'm wondering why it's not: + if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY)) + return false; Then I look back and it's indeed what it does before, starting from commit 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared", 2014-10-14): if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || !(vma->vm_flags & VM_SOFTDIRTY))) However I don't get it... Shouldn't this be "if soft dirty set, or soft dirty tracking not enabled, then we can grant the write bit"? The thing is afaiu VM_SOFTDIRTY works in the reversed way that soft dirty enabled only if it's cleared. Hmm... Am I the only one thinks it wrong?
On Thu, May 27, 2021 at 6:21 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, May 27, 2021 at 04:37:11PM -0700, Peter Collingbourne wrote: > > On Thu, May 27, 2021 at 2:41 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Peter, > > > > > > On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote: > > > > > > [...] > > > > > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > > > + unsigned long cp_flags) > > > > +{ > > > > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > > > > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > > > > + return false; > > > > + > > > > + if (page_count(pte_page(pte)) != 1) > > > > + return false; > > > > + } > > > > > > Can we make MM_CP_DIRTY_ACCT still in charge? IIUC that won't affect your use > > > case, something like: > > > > > > /* Never apply trick if MM_CP_DIRTY_ACCT not set */ > > > if (!(cp_flags & MM_CP_DIRTY_ACCT)) > > > return false; > > > > > > The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name > > > shows). Say, we should start to count on the dirty bit for applying the write > > > bit only if that flag set. With above, I think we can drop the pte_uffd_wp() > > > check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do > > > change_protection(). > > > > I don't think that would work. The anonymous pages that we're > > interesting in optimizing are private writable pages, for which > > vma_wants_writenotify(vma, vma->vm_page_prot) would return false (and > > thus MM_CP_DIRTY_ACCT would not be set, and thus your code would > > disable the optimization), because of this code at the top of > > vma_wants_writenotify: > > > > /* If it was private or non-writable, the write bit is already clear */ > > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > > return 0; > > > > IIUC, dirty accountable is about whether we can always apply the > > optimization no matter what the ref count is, so it isn't suitable for > > situations where we need to check the ref count. > > Ah I see.. Though it still looks weird e.g. the first check could have been > done before calling change_protection()? > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 96f4df023439..9270140afbbd 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -548,7 +548,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > * held in write mode. > */ > vma->vm_flags = newflags; > - dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); > + dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot) || > + (vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)); > vma_set_page_prot(vma); > > change_protection(vma, start, end, vma->vm_page_prot, > > Would something like this make the check even faster? That still doesn't seem like it would work either. I think we need three kinds of behavior (glossing over a bunch of details): - always make RW for certain shared pages (this is the original dirty accountable behavior) - don't make RW except for page_count==1 for certain private pages - don't optimize at all in other cases A single bit isn't enough to cover all of these possibilities. > Meanwhile when I'm looking at the rest I found I cannot understand the other > check in this patch regarding soft dirty: > > + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > + return false; > > I'm wondering why it's not: > > + if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY)) > + return false; > > Then I look back and it's indeed what it does before, starting from commit > 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after > VM_SOFTDIRTY cleared", 2014-10-14): > > if (dirty_accountable && pte_dirty(ptent) && > (pte_soft_dirty(ptent) || > !(vma->vm_flags & VM_SOFTDIRTY))) > > However I don't get it... Shouldn't this be "if soft dirty set, or soft dirty > tracking not enabled, then we can grant the write bit"? The thing is afaiu > VM_SOFTDIRTY works in the reversed way that soft dirty enabled only if it's > cleared. Hmm... Am I the only one thinks it wrong? No strong opinions here, I'm just preserving the original logic. Peter
On Thu, May 27, 2021 at 06:35:42PM -0700, Peter Collingbourne wrote: > On Thu, May 27, 2021 at 6:21 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, May 27, 2021 at 04:37:11PM -0700, Peter Collingbourne wrote: > > > On Thu, May 27, 2021 at 2:41 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > Peter, > > > > > > > > On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote: > > > > > > > > [...] > > > > > > > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > > > > + unsigned long cp_flags) > > > > > +{ > > > > > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > > > > > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > > > > > + return false; > > > > > + > > > > > + if (page_count(pte_page(pte)) != 1) > > > > > + return false; > > > > > + } > > > > > > > > Can we make MM_CP_DIRTY_ACCT still in charge? IIUC that won't affect your use > > > > case, something like: > > > > > > > > /* Never apply trick if MM_CP_DIRTY_ACCT not set */ > > > > if (!(cp_flags & MM_CP_DIRTY_ACCT)) > > > > return false; > > > > > > > > The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name > > > > shows). Say, we should start to count on the dirty bit for applying the write > > > > bit only if that flag set. With above, I think we can drop the pte_uffd_wp() > > > > check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do > > > > change_protection(). > > > > > > I don't think that would work. The anonymous pages that we're > > > interesting in optimizing are private writable pages, for which > > > vma_wants_writenotify(vma, vma->vm_page_prot) would return false (and > > > thus MM_CP_DIRTY_ACCT would not be set, and thus your code would > > > disable the optimization), because of this code at the top of > > > vma_wants_writenotify: > > > > > > /* If it was private or non-writable, the write bit is already clear */ > > > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > > > return 0; > > > > > > IIUC, dirty accountable is about whether we can always apply the > > > optimization no matter what the ref count is, so it isn't suitable for > > > situations where we need to check the ref count. > > > > Ah I see.. Though it still looks weird e.g. the first check could have been > > done before calling change_protection()? > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 96f4df023439..9270140afbbd 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -548,7 +548,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > > * held in write mode. > > */ > > vma->vm_flags = newflags; > > - dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); > > + dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot) || > > + (vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)); > > vma_set_page_prot(vma); > > > > change_protection(vma, start, end, vma->vm_page_prot, > > > > Would something like this make the check even faster? > > That still doesn't seem like it would work either. I think we need > three kinds of behavior (glossing over a bunch of details): > > - always make RW for certain shared pages (this is the original dirty > accountable behavior) > - don't make RW except for page_count==1 for certain private pages > - don't optimize at all in other cases > > A single bit isn't enough to cover all of these possibilities. Yes I guess you're right. > > > Meanwhile when I'm looking at the rest I found I cannot understand the other > > check in this patch regarding soft dirty: > > > > + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > > + return false; > > > > I'm wondering why it's not: > > > > + if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY)) > > + return false; > > > > Then I look back and it's indeed what it does before, starting from commit > > 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after > > VM_SOFTDIRTY cleared", 2014-10-14): > > > > if (dirty_accountable && pte_dirty(ptent) && > > (pte_soft_dirty(ptent) || > > !(vma->vm_flags & VM_SOFTDIRTY))) > > > > However I don't get it... Shouldn't this be "if soft dirty set, or soft dirty > > tracking not enabled, then we can grant the write bit"? The thing is afaiu > > VM_SOFTDIRTY works in the reversed way that soft dirty enabled only if it's > > cleared. Hmm... Am I the only one thinks it wrong? > > No strong opinions here, I'm just preserving the original logic. Yeah no reason to block your patch. I'll think about it. It's just that the 1st !MM_CP_DIRTY_ACCT check looks very hard to follow, however indeed I don't have any better idea to rewrite it. Then the patch looks good to me: Reviewed-by: Peter Xu <peterx@redhat.com> Thanks,
On Fri, May 28, 2021 at 08:32:25AM -0400, Peter Xu wrote: > On Thu, May 27, 2021 at 06:35:42PM -0700, Peter Collingbourne wrote: > > On Thu, May 27, 2021 at 6:21 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, May 27, 2021 at 04:37:11PM -0700, Peter Collingbourne wrote: > > > > On Thu, May 27, 2021 at 2:41 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > Peter, > > > > > > > > > > On Thu, May 27, 2021 at 12:04:53PM -0700, Peter Collingbourne wrote: > > > > > > > > > > [...] > > > > > > > > > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > > > > > + unsigned long cp_flags) > > > > > > +{ > > > > > > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > > > > > > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > > > > > > + return false; > > > > > > + > > > > > > + if (page_count(pte_page(pte)) != 1) > > > > > > + return false; > > > > > > + } > > > > > > > > > > Can we make MM_CP_DIRTY_ACCT still in charge? IIUC that won't affect your use > > > > > case, something like: > > > > > > > > > > /* Never apply trick if MM_CP_DIRTY_ACCT not set */ > > > > > if (!(cp_flags & MM_CP_DIRTY_ACCT)) > > > > > return false; > > > > > > > > > > The thing is that's really what MM_CP_DIRTY_ACCT is about, imho (as its name > > > > > shows). Say, we should start to count on the dirty bit for applying the write > > > > > bit only if that flag set. With above, I think we can drop the pte_uffd_wp() > > > > > check below because uffd_wp never applies MM_CP_DIRTY_ACCT when do > > > > > change_protection(). > > > > > > > > I don't think that would work. The anonymous pages that we're > > > > interesting in optimizing are private writable pages, for which > > > > vma_wants_writenotify(vma, vma->vm_page_prot) would return false (and > > > > thus MM_CP_DIRTY_ACCT would not be set, and thus your code would > > > > disable the optimization), because of this code at the top of > > > > vma_wants_writenotify: > > > > > > > > /* If it was private or non-writable, the write bit is already clear */ > > > > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > > > > return 0; > > > > > > > > IIUC, dirty accountable is about whether we can always apply the > > > > optimization no matter what the ref count is, so it isn't suitable for > > > > situations where we need to check the ref count. > > > > > > Ah I see.. Though it still looks weird e.g. the first check could have been > > > done before calling change_protection()? > > > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > > index 96f4df023439..9270140afbbd 100644 > > > --- a/mm/mprotect.c > > > +++ b/mm/mprotect.c > > > @@ -548,7 +548,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > > > * held in write mode. > > > */ > > > vma->vm_flags = newflags; > > > - dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); > > > + dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot) || > > > + (vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)); > > > vma_set_page_prot(vma); > > > > > > change_protection(vma, start, end, vma->vm_page_prot, > > > > > > Would something like this make the check even faster? > > > > That still doesn't seem like it would work either. I think we need > > three kinds of behavior (glossing over a bunch of details): > > > > - always make RW for certain shared pages (this is the original dirty > > accountable behavior) > > - don't make RW except for page_count==1 for certain private pages > > - don't optimize at all in other cases > > > > A single bit isn't enough to cover all of these possibilities. > > Yes I guess you're right. > > > > > > Meanwhile when I'm looking at the rest I found I cannot understand the other > > > check in this patch regarding soft dirty: > > > > > > + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > > > + return false; > > > > > > I'm wondering why it's not: > > > > > > + if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY)) > > > + return false; > > > > > > Then I look back and it's indeed what it does before, starting from commit > > > 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after > > > VM_SOFTDIRTY cleared", 2014-10-14): > > > > > > if (dirty_accountable && pte_dirty(ptent) && > > > (pte_soft_dirty(ptent) || > > > !(vma->vm_flags & VM_SOFTDIRTY))) > > > > > > However I don't get it... Shouldn't this be "if soft dirty set, or soft dirty > > > tracking not enabled, then we can grant the write bit"? The thing is afaiu > > > VM_SOFTDIRTY works in the reversed way that soft dirty enabled only if it's > > > cleared. Hmm... Am I the only one thinks it wrong? > > > > No strong opinions here, I'm just preserving the original logic. > > Yeah no reason to block your patch. I'll think about it. > > It's just that the 1st !MM_CP_DIRTY_ACCT check looks very hard to follow, > however indeed I don't have any better idea to rewrite it. > > Then the patch looks good to me: > > Reviewed-by: Peter Xu <peterx@redhat.com> Sorry I may need to retrieve the ACK here... TL;DR: we'd need to skip the optimization for safety when MM_CP_PROT_NUMA set The thing is that there's report showing that numa could have an issue with the other patch in Andrea's COR branch (which is a nicer one covering huge pages) if without prot_numa check there: https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=7471fd37fc2b8780fb4a3ab9a5ee9d87752e246a Now it's still not clear whether the same issue could happen with this patch as it's using page_count() rather than page_mapcount() due to the different handling of the two branches, however to be safe and before we figure that thing out, imho it's better this patch also skips NUMA path completely.
On Thu, 27 May 2021 12:04:53 -0700 Peter Collingbourne <pcc@google.com> wrote: > In the Scudo memory allocator [1] we would like to be able to > detect use-after-free vulnerabilities involving large allocations > by issuing mprotect(PROT_NONE) on the memory region used for the > allocation when it is deallocated. Later on, after the memory > region has been "quarantined" for a sufficient period of time we > would like to be able to use it for another allocation by issuing > mprotect(PROT_READ|PROT_WRITE). > > ... > > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -35,6 +35,29 @@ > > #include "internal.h" > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > + unsigned long cp_flags) Some comments would be nice, and the function is ideally structured to explain each test. "why" we're testing these things, not "what" we're testing. > +{ /* here */ > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > + return false; > + > + if (page_count(pte_page(pte)) != 1) > + return false; > + } > + /* and here */ > + if (!pte_dirty(pte)) > + return false; > + /* and here */ > + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > + return false; /* and here */ > + if (pte_uffd_wp(pte)) > + return false; > + > + return true; > +} > static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, pgprot_t newprot, > unsigned long cp_flags) > @@ -43,7 +66,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > spinlock_t *ptl; > unsigned long pages = 0; > int target_node = NUMA_NO_NODE; > - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; > bool prot_numa = cp_flags & MM_CP_PROT_NUMA; > bool uffd_wp = cp_flags & MM_CP_UFFD_WP; > bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; > @@ -132,11 +154,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > } > > /* Avoid taking write faults for known dirty pages */ And this comment could be moved to may_avoid_write_fault()'s explanation. > - if (dirty_accountable && pte_dirty(ptent) && > - (pte_soft_dirty(ptent) || > - !(vma->vm_flags & VM_SOFTDIRTY))) { > + if (may_avoid_write_fault(ptent, vma, cp_flags)) > ptent = pte_mkwrite(ptent); > - } > ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); > pages++; > } else if (is_swap_pte(oldpte)) { > -- > 2.32.0.rc0.204.g9fa02ecfa5-goog
diff --git a/mm/mprotect.c b/mm/mprotect.c index 94188df1ee55..880c90b5744e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -35,6 +35,29 @@ #include "internal.h" +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, + unsigned long cp_flags) +{ + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) + return false; + + if (page_count(pte_page(pte)) != 1) + return false; + } + + if (!pte_dirty(pte)) + return false; + + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) + return false; + + if (pte_uffd_wp(pte)) + return false; + + return true; +} + static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -43,7 +66,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, spinlock_t *ptl; unsigned long pages = 0; int target_node = NUMA_NO_NODE; - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; @@ -132,11 +154,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, } /* Avoid taking write faults for known dirty pages */ - if (dirty_accountable && pte_dirty(ptent) && - (pte_soft_dirty(ptent) || - !(vma->vm_flags & VM_SOFTDIRTY))) { + if (may_avoid_write_fault(ptent, vma, cp_flags)) ptent = pte_mkwrite(ptent); - } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; } else if (is_swap_pte(oldpte)) {
In the Scudo memory allocator [1] we would like to be able to detect use-after-free vulnerabilities involving large allocations by issuing mprotect(PROT_NONE) on the memory region used for the allocation when it is deallocated. Later on, after the memory region has been "quarantined" for a sufficient period of time we would like to be able to use it for another allocation by issuing mprotect(PROT_READ|PROT_WRITE). Before this patch, after removing the write protection, any writes to the memory region would result in page faults and entering the copy-on-write code path, even in the usual case where the pages are only referenced by a single PTE, harming performance unnecessarily. Make it so that any pages in anonymous mappings that are only referenced by a single PTE are immediately made writable during the mprotect so that we can avoid the page faults. This program shows the critical syscall sequence that we intend to use in the allocator: #include <string.h> #include <sys/mman.h> enum { kSize = 131072 }; int main(int argc, char **argv) { char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); for (int i = 0; i != 100000; ++i) { memset(addr, i, kSize); mprotect((void *)addr, kSize, PROT_NONE); mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); } } The effect of this patch on the above program was measured on a DragonBoard 845c by taking the median real time execution time of 10 runs. Before: 2.94s After: 0.66s The effect was also measured using one of the microbenchmarks that we normally use to benchmark the allocator [2], after modifying it to make the appropriate mprotect calls [3]. With an allocation size of 131072 bytes to trigger the allocator's "large allocation" code path the per-iteration time was measured as follows: Before: 27450ns After: 6010ns This patch means that we do more work during the mprotect call itself in exchange for less work when the pages are accessed. In the worst case, the pages are not accessed at all. The effect of this patch in such cases was measured using the following program: #include <string.h> #include <sys/mman.h> enum { kSize = 131072 }; int main(int argc, char **argv) { char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memset(addr, 1, kSize); for (int i = 0; i != 100000; ++i) { #ifdef PAGE_FAULT memset(addr + (i * 4096) % kSize, i, 4096); #endif mprotect((void *)addr, kSize, PROT_NONE); mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); } } With PAGE_FAULT undefined (0 pages touched after removing write protection) the median real time execution time of 100 runs was measured as follows: Before: 0.330260s After: 0.338836s With PAGE_FAULT defined (1 page touched) the measurements were as follows: Before: 0.438048s After: 0.355661s So it seems that even with a single page fault the new approach is faster. I saw similar results if I adjusted the programs to use a larger mapping size. With kSize = 1048576 I get these numbers with PAGE_FAULT undefined: Before: 1.428988s After: 1.512016s i.e. around 5.5%. And these with PAGE_FAULT defined: Before: 1.518559s After: 1.524417s i.e. about the same. What I think we may conclude from these results is that for smaller mappings the advantage of the previous approach, although measurable, is wiped out by a single page fault. I think we may expect that there should be at least one access resulting in a page fault (under the previous approach) after making the pages writable, since the program presumably made the pages writable for a reason. For larger mappings we may guesstimate that the new approach wins if the density of future page faults is > 0.4%. But for the mappings that are large enough for density to matter (not just the absolute number of page faults) it doesn't seem like the increase in mprotect latency would be very large relative to the total mprotect execution time. Signed-off-by: Peter Collingbourne <pcc@google.com> Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8 Link: [1] https://source.android.com/devices/tech/debug/scudo Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9 Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2 --- v4: - check pte_uffd_wp() to ensure that we still see UFFD faults - check page_count() instead of page_mapcount() to handle non-map references (e.g. FOLL_LONGTERM) - move the check into a separate function v3: - check for dirty pages - refresh the performance numbers v2: - improve the commit message mm/mprotect.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-)