diff mbox series

[v3,15/15] mm/mmap: Change vma iteration order in do_vmi_align_munmap()

Message ID 20230724183157.3939892-16-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series Reduce preallocations for maple tree | expand

Commit Message

Liam R. Howlett July 24, 2023, 6:31 p.m. UTC
By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations.  It
also means that prev/next will be loaded closer to when they are used.

This requires changing the loop type when gathering the VMAs that will
be freed.

Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).

Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

Comments

Jann Horn Aug. 14, 2023, 3:43 p.m. UTC | #1
@akpm

On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Since prev will be set later in the function, it is better to reverse
> the splitting direction of the start VMA (modify the new_below argument
> to __split_vma).

It might be a good idea to reorder "mm: always lock new vma before
inserting into vma tree" before this patch.

If you apply this patch without "mm: always lock new vma before
inserting into vma tree", I think move_vma(), when called with a start
address in the middle of a VMA, will behave like this:

 - vma_start_write() [lock the VMA to be moved]
 - move_page_tables() [moves page table entries]
 - do_vmi_munmap()
   - do_vmi_align_munmap()
     - __split_vma()
       - creates a new VMA **covering the moved range** that is **not locked**
       - stores the new VMA in the VMA tree **without locking it** [1]
     - new VMA is locked and removed again [2]
[...]

So after the page tables in the region have already been moved, I
believe there will be a brief window (between [1] and [2]) where page
faults in the region can happen again, which could probably cause new
page tables and PTEs to be created in the region again in that window.
(This can't happen in Linus' current tree because the new VMA created
by __split_vma() only covers the range that is not being moved.)

Though I guess that's not going to lead to anything bad, since
do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
So maybe it's not that important.
Andrew Morton Aug. 14, 2023, 7:10 p.m. UTC | #2
On Mon, 14 Aug 2023 17:43:39 +0200 Jann Horn <jannh@google.com> wrote:

> @akpm
> 
> On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Since prev will be set later in the function, it is better to reverse
> > the splitting direction of the start VMA (modify the new_below argument
> > to __split_vma).
> 
> It might be a good idea to reorder "mm: always lock new vma before
> inserting into vma tree" before this patch.
> 
> If you apply this patch without "mm: always lock new vma before
> inserting into vma tree", I think move_vma(), when called with a start
> address in the middle of a VMA, will behave like this:
> 
>  - vma_start_write() [lock the VMA to be moved]
>  - move_page_tables() [moves page table entries]
>  - do_vmi_munmap()
>    - do_vmi_align_munmap()
>      - __split_vma()
>        - creates a new VMA **covering the moved range** that is **not locked**
>        - stores the new VMA in the VMA tree **without locking it** [1]
>      - new VMA is locked and removed again [2]
> [...]
> 
> So after the page tables in the region have already been moved, I
> believe there will be a brief window (between [1] and [2]) where page
> faults in the region can happen again, which could probably cause new
> page tables and PTEs to be created in the region again in that window.
> (This can't happen in Linus' current tree because the new VMA created
> by __split_vma() only covers the range that is not being moved.)
> 
> Though I guess that's not going to lead to anything bad, since
> do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> So maybe it's not that important.

Thanks.  I'd of course prefer not to rebuild mm-stable.  If this ends
up being a hard-to-hit issue during git-bisect searches, I think we can
live with that.
Liam R. Howlett Aug. 14, 2023, 7:18 p.m. UTC | #3
* Jann Horn <jannh@google.com> [230814 11:44]:
> @akpm
> 
> On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Since prev will be set later in the function, it is better to reverse
> > the splitting direction of the start VMA (modify the new_below argument
> > to __split_vma).
> 
> It might be a good idea to reorder "mm: always lock new vma before
> inserting into vma tree" before this patch.
> 
> If you apply this patch without "mm: always lock new vma before
> inserting into vma tree", I think move_vma(), when called with a start
> address in the middle of a VMA, will behave like this:
> 
>  - vma_start_write() [lock the VMA to be moved]
>  - move_page_tables() [moves page table entries]
>  - do_vmi_munmap()
>    - do_vmi_align_munmap()
>      - __split_vma()
>        - creates a new VMA **covering the moved range** that is **not locked**
>        - stores the new VMA in the VMA tree **without locking it** [1]
>      - new VMA is locked and removed again [2]
> [...]
> 
> So after the page tables in the region have already been moved, I
> believe there will be a brief window (between [1] and [2]) where page
> faults in the region can happen again, which could probably cause new
> page tables and PTEs to be created in the region again in that window.
> (This can't happen in Linus' current tree because the new VMA created
> by __split_vma() only covers the range that is not being moved.)

Ah, so my reversing of which VMA to keep to the first split call opens a
window where the VMA being removed is not locked.  Good catch.

> 
> Though I guess that's not going to lead to anything bad, since
> do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> So maybe it's not that important.

do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
the start of the next (even over areas without VMA coverages - because
there are platforms what need this, apparently).

I don't have any objections in the ordering or see an issue resulting
from having it this way... Except for maybe lockdep, so maybe we should
change the ordering of the patch sets just to be safe?

In fact, should we add another check somewhere to ensure we do generate
the warning?  Perhaps to remove_mt() to avoid the exit path hitting it?

Thanks,
Liam
Jann Horn Aug. 14, 2023, 9:22 p.m. UTC | #4
On Mon, Aug 14, 2023 at 10:32 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
> * Jann Horn <jannh@google.com> [230814 11:44]:
> > @akpm
> >
> > On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > Since prev will be set later in the function, it is better to reverse
> > > the splitting direction of the start VMA (modify the new_below argument
> > > to __split_vma).
> >
> > It might be a good idea to reorder "mm: always lock new vma before
> > inserting into vma tree" before this patch.
> >
> > If you apply this patch without "mm: always lock new vma before
> > inserting into vma tree", I think move_vma(), when called with a start
> > address in the middle of a VMA, will behave like this:
> >
> >  - vma_start_write() [lock the VMA to be moved]
> >  - move_page_tables() [moves page table entries]
> >  - do_vmi_munmap()
> >    - do_vmi_align_munmap()
> >      - __split_vma()
> >        - creates a new VMA **covering the moved range** that is **not locked**
> >        - stores the new VMA in the VMA tree **without locking it** [1]
> >      - new VMA is locked and removed again [2]
> > [...]
> >
> > So after the page tables in the region have already been moved, I
> > believe there will be a brief window (between [1] and [2]) where page
> > faults in the region can happen again, which could probably cause new
> > page tables and PTEs to be created in the region again in that window.
> > (This can't happen in Linus' current tree because the new VMA created
> > by __split_vma() only covers the range that is not being moved.)
>
> Ah, so my reversing of which VMA to keep to the first split call opens a
> window where the VMA being removed is not locked.  Good catch.
>
> >
> > Though I guess that's not going to lead to anything bad, since
> > do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> > So maybe it's not that important.
>
> do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
> the start of the next

Alright, I guess no action is needed here then.

> I don't have any objections in the ordering or see an issue resulting
> from having it this way... Except for maybe lockdep, so maybe we should
> change the ordering of the patch sets just to be safe?
>
> In fact, should we add another check somewhere to ensure we do generate
> the warning?  Perhaps to remove_mt() to avoid the exit path hitting it?

I'm not sure which lockdep check you mean. do_vmi_align_munmap() is
going to lock the VMAs again before it operates on them; I guess the
only checks that would catch this would be the page table validation
logic or the RSS counter checks on exit?
Liam R. Howlett Aug. 15, 2023, 7:29 a.m. UTC | #5
* Jann Horn <jannh@google.com> [230814 17:22]:
> On Mon, Aug 14, 2023 at 10:32 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> > * Jann Horn <jannh@google.com> [230814 11:44]:
> > > @akpm
> > >
> > > On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > Since prev will be set later in the function, it is better to reverse
> > > > the splitting direction of the start VMA (modify the new_below argument
> > > > to __split_vma).
> > >
> > > It might be a good idea to reorder "mm: always lock new vma before
> > > inserting into vma tree" before this patch.
> > >
> > > If you apply this patch without "mm: always lock new vma before
> > > inserting into vma tree", I think move_vma(), when called with a start
> > > address in the middle of a VMA, will behave like this:
> > >
> > >  - vma_start_write() [lock the VMA to be moved]
> > >  - move_page_tables() [moves page table entries]
> > >  - do_vmi_munmap()
> > >    - do_vmi_align_munmap()
> > >      - __split_vma()
> > >        - creates a new VMA **covering the moved range** that is **not locked**
> > >        - stores the new VMA in the VMA tree **without locking it** [1]
> > >      - new VMA is locked and removed again [2]
> > > [...]
> > >
> > > So after the page tables in the region have already been moved, I
> > > believe there will be a brief window (between [1] and [2]) where page
> > > faults in the region can happen again, which could probably cause new
> > > page tables and PTEs to be created in the region again in that window.
> > > (This can't happen in Linus' current tree because the new VMA created
> > > by __split_vma() only covers the range that is not being moved.)
> >
> > Ah, so my reversing of which VMA to keep to the first split call opens a
> > window where the VMA being removed is not locked.  Good catch.

Looking at this again, I think it exists in Linus' tree and my change
actually removes this window:

-               error = __split_vma(vmi, vma, start, 0);
+               error = __split_vma(vmi, vma, start, 1);
                if (error)
                        goto start_split_failed;

The last argument is "new_below", which means the new VMA will be at the
lower address.  I don't love the argument of int or the name, also the
documentation is lacking for the split function.

So, once we split at "start", vm_end = "start" in the new VMA while
start will be in the old VMA.  I then lock the old vma to be removed
(again) and add it to the detached maple tree.

Before my patch, we split the VMA and took the new unlocked VMA for
removal.. until I locked the new vma to be removed and add it to the
detached maple tree.  So there is a window that we write the new split
VMA into the tree prior to locking the VMA, but it is locked before
removal.

This change actually aligns the splitting with the other callers who use
the split_vma() wrapper.

> >
> > >
> > > Though I guess that's not going to lead to anything bad, since
> > > do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> > > So maybe it's not that important.
> >
> > do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
> > the start of the next
> 
> Alright, I guess no action is needed here then.

I don't see a difference between this and the race that exists after the
page fault ends and a task unmaps the area prior to the first task using
the faulted areas?

> 
> > I don't have any objections in the ordering or see an issue resulting
> > from having it this way... Except for maybe lockdep, so maybe we should
> > change the ordering of the patch sets just to be safe?
> >
> > In fact, should we add another check somewhere to ensure we do generate
> > the warning?  Perhaps to remove_mt() to avoid the exit path hitting it?
> 
> I'm not sure which lockdep check you mean. do_vmi_align_munmap() is
> going to lock the VMAs again before it operates on them; I guess the
> only checks that would catch this would be the page table validation
> logic or the RSS counter checks on exit?
> 

I'm trying to add a lockdep to detect this potential window in the
future, but it won't work as you pointed out since it will be locked
before removal.  I'm not sure it's worth it since Suren added more
lockdep checks in his series.

I appreciate you really looking at these changes and thinking them
through.

Regards,
Liam
Jann Horn Aug. 15, 2023, 2:19 p.m. UTC | #6
On Tue, Aug 15, 2023 at 9:29 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jann Horn <jannh@google.com> [230814 17:22]:
> > On Mon, Aug 14, 2023 at 10:32 PM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > > * Jann Horn <jannh@google.com> [230814 11:44]:
> > > > @akpm
> > > >
> > > > On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > Since prev will be set later in the function, it is better to reverse
> > > > > the splitting direction of the start VMA (modify the new_below argument
> > > > > to __split_vma).
> > > >
> > > > It might be a good idea to reorder "mm: always lock new vma before
> > > > inserting into vma tree" before this patch.
> > > >
> > > > If you apply this patch without "mm: always lock new vma before
> > > > inserting into vma tree", I think move_vma(), when called with a start
> > > > address in the middle of a VMA, will behave like this:
> > > >
> > > >  - vma_start_write() [lock the VMA to be moved]
> > > >  - move_page_tables() [moves page table entries]
> > > >  - do_vmi_munmap()
> > > >    - do_vmi_align_munmap()
> > > >      - __split_vma()
> > > >        - creates a new VMA **covering the moved range** that is **not locked**
> > > >        - stores the new VMA in the VMA tree **without locking it** [1]
> > > >      - new VMA is locked and removed again [2]
> > > > [...]
> > > >
> > > > So after the page tables in the region have already been moved, I
> > > > believe there will be a brief window (between [1] and [2]) where page
> > > > faults in the region can happen again, which could probably cause new
> > > > page tables and PTEs to be created in the region again in that window.
> > > > (This can't happen in Linus' current tree because the new VMA created
> > > > by __split_vma() only covers the range that is not being moved.)
> > >
> > > Ah, so my reversing of which VMA to keep to the first split call opens a
> > > window where the VMA being removed is not locked.  Good catch.
>
> Looking at this again, I think it exists in Linus' tree and my change
> actually removes this window:
>
> -               error = __split_vma(vmi, vma, start, 0);
> +               error = __split_vma(vmi, vma, start, 1);
>                 if (error)
>                         goto start_split_failed;
>
> The last argument is "new_below", which means the new VMA will be at the
> lower address.  I don't love the argument of int or the name, also the
> documentation is lacking for the split function.
>
> So, once we split at "start", vm_end = "start" in the new VMA while
> start will be in the old VMA.  I then lock the old vma to be removed
> (again) and add it to the detached maple tree.
>
> Before my patch, we split the VMA and took the new unlocked VMA for
> removal.. until I locked the new vma to be removed and add it to the
> detached maple tree.  So there is a window that we write the new split
> VMA into the tree prior to locking the VMA, but it is locked before
> removal.
>
> This change actually aligns the splitting with the other callers who use
> the split_vma() wrapper.

Oooh, you're right. Sorry, I misread that.

> > >
> > > >
> > > > Though I guess that's not going to lead to anything bad, since
> > > > do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> > > > So maybe it's not that important.
> > >
> > > do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
> > > the start of the next
> >
> > Alright, I guess no action is needed here then.
>
> I don't see a difference between this and the race that exists after the
> page fault ends and a task unmaps the area prior to the first task using
> the faulted areas?

Yeah, you're right. I was thinking about it in terms of "this is a
weird situation and it would be dangerous if something relied on there
not being any non-empty PTEs in the range", but there's nothing that
relies on it, so that's fine.
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 58f7b7038e4c..f11c0d663deb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2451,20 +2451,17 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
-		error = __split_vma(vmi, vma, start, 0);
+		error = __split_vma(vmi, vma, start, 1);
 		if (error)
 			goto start_split_failed;
-
-		vma = vma_iter_load(vmi);
 	}
 
-	prev = vma_prev(vmi);
-
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	for_each_vma_range(*vmi, next, end) {
+	next = vma;
+	do {
 		/* Does it split the end? */
 		if (next->vm_end > end) {
 			error = __split_vma(vmi, next, end, 0);
@@ -2500,13 +2497,7 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		BUG_ON(next->vm_start < start);
 		BUG_ON(next->vm_start > end);
 #endif
-	}
-
-	if (vma_iter_end(vmi) > end)
-		next = vma_iter_load(vmi);
-
-	if (!next)
-		next = vma_next(vmi);
+	} for_each_vma_range(*vmi, next, end);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
@@ -2527,7 +2518,10 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		BUG_ON(count != test_count);
 	}
 #endif
-	vma_iter_set(vmi, start);
+
+	while (vma_iter_addr(vmi) > start)
+		vma_iter_prev_range(vmi);
+
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
 	if (error)
 		goto clear_tree_failed;
@@ -2538,6 +2532,11 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (unlock)
 		mmap_write_downgrade(mm);
 
+	prev = vma_iter_prev_range(vmi);
+	next = vma_next(vmi);
+	if (next)
+		vma_iter_prev_range(vmi);
+
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.