diff mbox series

[v3,17/35] mm/mmap: write-lock VMA before shrinking or expanding it

Message ID 20230216051750.3125598-18-surenb@google.com (mailing list archive)
State New
Headers show
Series Per-VMA locks | expand

Commit Message

Suren Baghdasaryan Feb. 16, 2023, 5:17 a.m. UTC
vma_expand and vma_shrink change VMA boundaries. Expansion might also
result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
concurrent page faults.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Liam R. Howlett Feb. 23, 2023, 8:20 p.m. UTC | #1
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

* Suren Baghdasaryan <surenb@google.com> [230216 00:18]:
> vma_expand and vma_shrink change VMA boundaries. Expansion might also
> result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> concurrent page faults.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ec2f8d0af280..f079e5bbcd57 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		ret = dup_anon_vma(vma, next);
>  		if (ret)
>  			return ret;
> +
> +		/* Lock the VMA  before removing it */
> +		vma_start_write(next);
>  	}
>  
>  	init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_prealloc(vmi))
>  		goto nomem;
>  
> +	vma_start_write(vma);
>  	vma_adjust_trans_huge(vma, start, end, 0);
>  	/* VMA iterator points to previous, so set to start if necessary */
>  	if (vma_iter_addr(vmi) != start)
> @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_prealloc(vmi))
>  		return -ENOMEM;
>  
> +	vma_start_write(vma);
>  	init_vma_prep(&vp, vma);
>  	vma_adjust_trans_huge(vma, start, end, 0);
>  	vma_prepare(&vp);
> -- 
> 2.39.1
>
Liam R. Howlett Feb. 23, 2023, 8:28 p.m. UTC | #2
Wait, I figured a better place to do this.

init_multi_vma_prep() should vma_start_write() on any VMA that is passed
in.. that we we catch any modifications here & in vma_merge(), which I
think is missed in this patch set?


* Liam R. Howlett <Liam.Howlett@Oracle.com> [230223 15:20]:
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> 
> * Suren Baghdasaryan <surenb@google.com> [230216 00:18]:
> > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > concurrent page faults.
> > 
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index ec2f8d0af280..f079e5bbcd57 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  		ret = dup_anon_vma(vma, next);
> >  		if (ret)
> >  			return ret;
> > +
> > +		/* Lock the VMA  before removing it */
> > +		vma_start_write(next);
> >  	}
> >  
> >  	init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	if (vma_iter_prealloc(vmi))
> >  		goto nomem;
> >  
> > +	vma_start_write(vma);
> >  	vma_adjust_trans_huge(vma, start, end, 0);
> >  	/* VMA iterator points to previous, so set to start if necessary */
> >  	if (vma_iter_addr(vmi) != start)
> > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	if (vma_iter_prealloc(vmi))
> >  		return -ENOMEM;
> >  
> > +	vma_start_write(vma);
> >  	init_vma_prep(&vp, vma);
> >  	vma_adjust_trans_huge(vma, start, end, 0);
> >  	vma_prepare(&vp);
> > -- 
> > 2.39.1
> >
Suren Baghdasaryan Feb. 23, 2023, 9:16 p.m. UTC | #3
On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
>
> Wait, I figured a better place to do this.
>
> init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> in.. that we we catch any modifications here & in vma_merge(), which I
> think is missed in this patch set?

Hmm. That looks like a good idea but in that case, why not do the
locking inside vma_prepare() itself? From the description of that
function it sounds like it was designed to acquire locks before VMA
modifications, so would be the ideal location for doing that. WDYT?
The only concern is vma_adjust_trans_huge() being called before
vma_prepare() but I *think* that's safe because
vma_adjust_trans_huge() does its modifications after acquiring PTL
lock, which page fault handlers also have to take. Does that sound
right?

>
>
> * Liam R. Howlett <Liam.Howlett@Oracle.com> [230223 15:20]:
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >
> > * Suren Baghdasaryan <surenb@google.com> [230216 00:18]:
> > > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > > concurrent page faults.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/mmap.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index ec2f8d0af280..f079e5bbcd57 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >             ret = dup_anon_vma(vma, next);
> > >             if (ret)
> > >                     return ret;
> > > +
> > > +           /* Lock the VMA  before removing it */
> > > +           vma_start_write(next);
> > >     }
> > >
> > >     init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >     if (vma_iter_prealloc(vmi))
> > >             goto nomem;
> > >
> > > +   vma_start_write(vma);
> > >     vma_adjust_trans_huge(vma, start, end, 0);
> > >     /* VMA iterator points to previous, so set to start if necessary */
> > >     if (vma_iter_addr(vmi) != start)
> > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >     if (vma_iter_prealloc(vmi))
> > >             return -ENOMEM;
> > >
> > > +   vma_start_write(vma);
> > >     init_vma_prep(&vp, vma);
> > >     vma_adjust_trans_huge(vma, start, end, 0);
> > >     vma_prepare(&vp);
> > > --
> > > 2.39.1
> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Liam R. Howlett Feb. 24, 2023, 1:46 a.m. UTC | #4
* Suren Baghdasaryan <surenb@google.com> [230223 16:16]:
> On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> >
> > Wait, I figured a better place to do this.
> >
> > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > in.. that we we catch any modifications here & in vma_merge(), which I
> > think is missed in this patch set?
> 
> Hmm. That looks like a good idea but in that case, why not do the
> locking inside vma_prepare() itself? From the description of that
> function it sounds like it was designed to acquire locks before VMA
> modifications, so would be the ideal location for doing that. WDYT?

That might be even better.  I think it will result in even less code.

There is also a vma_complete() which might work to call
vma_end_write_all() as well?

> The only concern is vma_adjust_trans_huge() being called before
> vma_prepare() but I *think* that's safe because
> vma_adjust_trans_huge() does its modifications after acquiring PTL
> lock, which page fault handlers also have to take. Does that sound
> right?

I am not sure.  We are certainly safe the way it is, and the PTL has to
be safe for concurrent faults.. but this could alter the walk to a page
table while that walk is occurring and I don't think that happens today.

It might be best to leave the locking order the way you have it, unless
someone can tell us it's safe?

We could pass through the three extra variables that are needed to move
the vma_adjust_trans_huge() call within that function as well?  This
would have the added benefit of having all locking grouped in the one
location, but the argument list would be getting long, however we could
use the struct.

remove & remove2 should be be detached in vma_prepare() or
vma_complete() as well?

> 
> >
> >
> > * Liam R. Howlett <Liam.Howlett@Oracle.com> [230223 15:20]:
> > > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [230216 00:18]:
> > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > > > concurrent page faults.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  mm/mmap.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index ec2f8d0af280..f079e5bbcd57 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > >             ret = dup_anon_vma(vma, next);
> > > >             if (ret)
> > > >                     return ret;
> > > > +
> > > > +           /* Lock the VMA  before removing it */
> > > > +           vma_start_write(next);
> > > >     }
> > > >
> > > >     init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > >     if (vma_iter_prealloc(vmi))
> > > >             goto nomem;
> > > >
> > > > +   vma_start_write(vma);
> > > >     vma_adjust_trans_huge(vma, start, end, 0);
> > > >     /* VMA iterator points to previous, so set to start if necessary */
> > > >     if (vma_iter_addr(vmi) != start)
> > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > >     if (vma_iter_prealloc(vmi))
> > > >             return -ENOMEM;
> > > >
> > > > +   vma_start_write(vma);
> > > >     init_vma_prep(&vp, vma);
> > > >     vma_adjust_trans_huge(vma, start, end, 0);
> > > >     vma_prepare(&vp);
> > > > --
> > > > 2.39.1
> > > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
Suren Baghdasaryan Feb. 24, 2023, 2:06 a.m. UTC | #5
On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [230223 16:16]:
> > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
> > >
> > > Wait, I figured a better place to do this.
> > >
> > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > think is missed in this patch set?
> >
> > Hmm. That looks like a good idea but in that case, why not do the
> > locking inside vma_prepare() itself? From the description of that
> > function it sounds like it was designed to acquire locks before VMA
> > modifications, so would be the ideal location for doing that. WDYT?
>
> That might be even better.  I think it will result in even less code.

Yes.

>
> There is also a vma_complete() which might work to call
> vma_end_write_all() as well?

If there are other VMAs already locked before vma_prepare() then we
would unlock them too. Safer to just let mmap_unlock do
vma_end_write_all().

>
> > The only concern is vma_adjust_trans_huge() being called before
> > vma_prepare() but I *think* that's safe because
> > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > lock, which page fault handlers also have to take. Does that sound
> > right?
>
> I am not sure.  We are certainly safe the way it is, and the PTL has to
> be safe for concurrent faults.. but this could alter the walk to a page
> table while that walk is occurring and I don't think that happens today.
>
> It might be best to leave the locking order the way you have it, unless
> someone can tell us it's safe?

Yes, I have the same feelings about changing this.

>
> We could pass through the three extra variables that are needed to move
> the vma_adjust_trans_huge() call within that function as well?  This
> would have the added benefit of having all locking grouped in the one
> location, but the argument list would be getting long, however we could
> use the struct.

Any issues if I change the order to have vma_prepare() called always
before vma_adjust_trans_huge()? That way the VMA will always be locked
before vma_adjust_trans_huge() executes and we don't need any
additional arguments.

>
> remove & remove2 should be be detached in vma_prepare() or
> vma_complete() as well?

They are marked detached in vma_complete() (see
https://lore.kernel.org/all/20230216051750.3125598-25-surenb@google.com/)
and that should be enough. We should be safe as long as we mark them
detached before unlocking the VMA.

>
> >
> > >
> > >
> > > * Liam R. Howlett <Liam.Howlett@Oracle.com> [230223 15:20]:
> > > > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > >
> > > > * Suren Baghdasaryan <surenb@google.com> [230216 00:18]:
> > > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > > > > concurrent page faults.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > ---
> > > > >  mm/mmap.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > index ec2f8d0af280..f079e5bbcd57 100644
> > > > > --- a/mm/mmap.c
> > > > > +++ b/mm/mmap.c
> > > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > >             ret = dup_anon_vma(vma, next);
> > > > >             if (ret)
> > > > >                     return ret;
> > > > > +
> > > > > +           /* Lock the VMA  before removing it */
> > > > > +           vma_start_write(next);
> > > > >     }
> > > > >
> > > > >     init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > >     if (vma_iter_prealloc(vmi))
> > > > >             goto nomem;
> > > > >
> > > > > +   vma_start_write(vma);
> > > > >     vma_adjust_trans_huge(vma, start, end, 0);
> > > > >     /* VMA iterator points to previous, so set to start if necessary */
> > > > >     if (vma_iter_addr(vmi) != start)
> > > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > >     if (vma_iter_prealloc(vmi))
> > > > >             return -ENOMEM;
> > > > >
> > > > > +   vma_start_write(vma);
> > > > >     init_vma_prep(&vp, vma);
> > > > >     vma_adjust_trans_huge(vma, start, end, 0);
> > > > >     vma_prepare(&vp);
> > > > > --
> > > > > 2.39.1
> > > > >
> > >
> > > --
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Liam R. Howlett Feb. 24, 2023, 4:14 p.m. UTC | #6
* Suren Baghdasaryan <surenb@google.com> [230223 21:06]:
> On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [230223 16:16]:
> > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > > <Liam.Howlett@oracle.com> wrote:
> > > >
> > > >
> > > > Wait, I figured a better place to do this.
> > > >
> > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > > think is missed in this patch set?
> > >
> > > Hmm. That looks like a good idea but in that case, why not do the
> > > locking inside vma_prepare() itself? From the description of that
> > > function it sounds like it was designed to acquire locks before VMA
> > > modifications, so would be the ideal location for doing that. WDYT?
> >
> > That might be even better.  I think it will result in even less code.
> 
> Yes.
> 
> >
> > There is also a vma_complete() which might work to call
> > vma_end_write_all() as well?
> 
> If there are other VMAs already locked before vma_prepare() then we
> would unlock them too. Safer to just let mmap_unlock do
> vma_end_write_all().
> 
> >
> > > The only concern is vma_adjust_trans_huge() being called before
> > > vma_prepare() but I *think* that's safe because
> > > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > > lock, which page fault handlers also have to take. Does that sound
> > > right?
> >
> > I am not sure.  We are certainly safe the way it is, and the PTL has to
> > be safe for concurrent faults.. but this could alter the walk to a page
> > table while that walk is occurring and I don't think that happens today.
> >
> > It might be best to leave the locking order the way you have it, unless
> > someone can tell us it's safe?
> 
> Yes, I have the same feelings about changing this.
> 
> >
> > We could pass through the three extra variables that are needed to move
> > the vma_adjust_trans_huge() call within that function as well?  This
> > would have the added benefit of having all locking grouped in the one
> > location, but the argument list would be getting long, however we could
> > use the struct.
> 
> Any issues if I change the order to have vma_prepare() called always
> before vma_adjust_trans_huge()? That way the VMA will always be locked
> before vma_adjust_trans_huge() executes and we don't need any
> additional arguments.

I preserved the locking order from __vma_adjust() to ensure there was no
issues.

I am not sure but, looking through the page table information [1], it
seems that vma_adjust_trans_huge() uses the pmd lock, which is part of
the split page table lock.  According to the comment in rmap, it should
be fine to reverse the ordering here.

Instead of:

mmap_lock()
vma_adjust_trans_huge()
	pte_lock
	pte_unlock

vma_prepare()
	mapping->i_mmap_rwsem lock
	anon_vma->rwsem lock

<changes to tree/VMAs>

vma_complete()
	anon_vma->rwsem unlock
	mapping->i_mmap_rwsem unlock

mmap_unlock()

---------

We would have:

mmap_lock()
vma_prepare()
	mapping->i_mmap_rwsem lock
	anon_vma->rwsem lock

vma_adjust_trans_huge()
	pte_lock
	pte_unlock

<changes to tree/VMAs>

vma_complete()
	anon_vma->rwsem unlock
	mapping->i_mmap_rwsem unlock

mmap_unlock()


Essentially, increasing the nesting of the pte lock, but not violating
the ordering.

1. https://docs.kernel.org/mm/split_page_table_lock.html

> 
> >
> > remove & remove2 should be be detached in vma_prepare() or
> > vma_complete() as well?
> 
> They are marked detached in vma_complete() (see
> https://lore.kernel.org/all/20230216051750.3125598-25-surenb@google.com/)
> and that should be enough. We should be safe as long as we mark them
> detached before unlocking the VMA.
> 

Right, Thanks.

...
Suren Baghdasaryan Feb. 24, 2023, 4:19 p.m. UTC | #7
On Fri, Feb 24, 2023 at 8:14 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [230223 21:06]:
> > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [230223 16:16]:
> > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > > > <Liam.Howlett@oracle.com> wrote:
> > > > >
> > > > >
> > > > > Wait, I figured a better place to do this.
> > > > >
> > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > > > think is missed in this patch set?
> > > >
> > > > Hmm. That looks like a good idea but in that case, why not do the
> > > > locking inside vma_prepare() itself? From the description of that
> > > > function it sounds like it was designed to acquire locks before VMA
> > > > modifications, so would be the ideal location for doing that. WDYT?
> > >
> > > That might be even better.  I think it will result in even less code.
> >
> > Yes.
> >
> > >
> > > There is also a vma_complete() which might work to call
> > > vma_end_write_all() as well?
> >
> > If there are other VMAs already locked before vma_prepare() then we
> > would unlock them too. Safer to just let mmap_unlock do
> > vma_end_write_all().
> >
> > >
> > > > The only concern is vma_adjust_trans_huge() being called before
> > > > vma_prepare() but I *think* that's safe because
> > > > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > > > lock, which page fault handlers also have to take. Does that sound
> > > > right?
> > >
> > > I am not sure.  We are certainly safe the way it is, and the PTL has to
> > > be safe for concurrent faults.. but this could alter the walk to a page
> > > table while that walk is occurring and I don't think that happens today.
> > >
> > > It might be best to leave the locking order the way you have it, unless
> > > someone can tell us it's safe?
> >
> > Yes, I have the same feelings about changing this.
> >
> > >
> > > We could pass through the three extra variables that are needed to move
> > > the vma_adjust_trans_huge() call within that function as well?  This
> > > would have the added benefit of having all locking grouped in the one
> > > location, but the argument list would be getting long, however we could
> > > use the struct.
> >
> > Any issues if I change the order to have vma_prepare() called always
> > before vma_adjust_trans_huge()? That way the VMA will always be locked
> > before vma_adjust_trans_huge() executes and we don't need any
> > additional arguments.
>
> I preserved the locking order from __vma_adjust() to ensure there was no
> issues.
>
> I am not sure but, looking through the page table information [1], it
> seems that vma_adjust_trans_huge() uses the pmd lock, which is part of
> the split page table lock.  According to the comment in rmap, it should
> be fine to reverse the ordering here.
>
> Instead of:
>
> mmap_lock()
> vma_adjust_trans_huge()
>         pte_lock
>         pte_unlock
>
> vma_prepare()
>         mapping->i_mmap_rwsem lock
>         anon_vma->rwsem lock
>
> <changes to tree/VMAs>
>
> vma_complete()
>         anon_vma->rwsem unlock
>         mapping->i_mmap_rwsem unlock
>
> mmap_unlock()
>
> ---------
>
> We would have:
>
> mmap_lock()
> vma_prepare()
>         mapping->i_mmap_rwsem lock
>         anon_vma->rwsem lock
>
> vma_adjust_trans_huge()
>         pte_lock
>         pte_unlock
>
> <changes to tree/VMAs>
>
> vma_complete()
>         anon_vma->rwsem unlock
>         mapping->i_mmap_rwsem unlock
>
> mmap_unlock()
>
>
> Essentially, increasing the nesting of the pte lock, but not violating
> the ordering.
>
> 1. https://docs.kernel.org/mm/split_page_table_lock.html

Thanks for the confirmation, Liam. I'll make the changes and test over
the weekend. If everything's still fine, I will post the next version
with these and other requested changes on Monday.

>
> >
> > >
> > > remove & remove2 should be be detached in vma_prepare() or
> > > vma_complete() as well?
> >
> > They are marked detached in vma_complete() (see
> > https://lore.kernel.org/all/20230216051750.3125598-25-surenb@google.com/)
> > and that should be enough. We should be safe as long as we mark them
> > detached before unlocking the VMA.
> >
>
> Right, Thanks.
>
> ...
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan Feb. 27, 2023, 5:33 p.m. UTC | #8
On Fri, Feb 24, 2023 at 8:19 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Feb 24, 2023 at 8:14 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [230223 21:06]:
> > > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Suren Baghdasaryan <surenb@google.com> [230223 16:16]:
> > > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > > > > <Liam.Howlett@oracle.com> wrote:
> > > > > >
> > > > > >
> > > > > > Wait, I figured a better place to do this.
> > > > > >
> > > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > > > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > > > > think is missed in this patch set?
> > > > >
> > > > > Hmm. That looks like a good idea but in that case, why not do the
> > > > > locking inside vma_prepare() itself? From the description of that
> > > > > function it sounds like it was designed to acquire locks before VMA
> > > > > modifications, so would be the ideal location for doing that. WDYT?
> > > >
> > > > That might be even better.  I think it will result in even less code.
> > >
> > > Yes.
> > >
> > > >
> > > > There is also a vma_complete() which might work to call
> > > > vma_end_write_all() as well?
> > >
> > > If there are other VMAs already locked before vma_prepare() then we
> > > would unlock them too. Safer to just let mmap_unlock do
> > > vma_end_write_all().
> > >
> > > >
> > > > > The only concern is vma_adjust_trans_huge() being called before
> > > > > vma_prepare() but I *think* that's safe because
> > > > > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > > > > lock, which page fault handlers also have to take. Does that sound
> > > > > right?
> > > >
> > > > I am not sure.  We are certainly safe the way it is, and the PTL has to
> > > > be safe for concurrent faults.. but this could alter the walk to a page
> > > > table while that walk is occurring and I don't think that happens today.
> > > >
> > > > It might be best to leave the locking order the way you have it, unless
> > > > someone can tell us it's safe?
> > >
> > > Yes, I have the same feelings about changing this.
> > >
> > > >
> > > > We could pass through the three extra variables that are needed to move
> > > > the vma_adjust_trans_huge() call within that function as well?  This
> > > > would have the added benefit of having all locking grouped in the one
> > > > location, but the argument list would be getting long, however we could
> > > > use the struct.
> > >
> > > Any issues if I change the order to have vma_prepare() called always
> > > before vma_adjust_trans_huge()? That way the VMA will always be locked
> > > before vma_adjust_trans_huge() executes and we don't need any
> > > additional arguments.
> >
> > I preserved the locking order from __vma_adjust() to ensure there was no
> > issues.
> >
> > I am not sure but, looking through the page table information [1], it
> > seems that vma_adjust_trans_huge() uses the pmd lock, which is part of
> > the split page table lock.  According to the comment in rmap, it should
> > be fine to reverse the ordering here.
> >
> > Instead of:
> >
> > mmap_lock()
> > vma_adjust_trans_huge()
> >         pte_lock
> >         pte_unlock
> >
> > vma_prepare()
> >         mapping->i_mmap_rwsem lock
> >         anon_vma->rwsem lock
> >
> > <changes to tree/VMAs>
> >
> > vma_complete()
> >         anon_vma->rwsem unlock
> >         mapping->i_mmap_rwsem unlock
> >
> > mmap_unlock()
> >
> > ---------
> >
> > We would have:
> >
> > mmap_lock()
> > vma_prepare()
> >         mapping->i_mmap_rwsem lock
> >         anon_vma->rwsem lock
> >
> > vma_adjust_trans_huge()
> >         pte_lock
> >         pte_unlock
> >
> > <changes to tree/VMAs>
> >
> > vma_complete()
> >         anon_vma->rwsem unlock
> >         mapping->i_mmap_rwsem unlock
> >
> > mmap_unlock()
> >
> >
> > Essentially, increasing the nesting of the pte lock, but not violating
> > the ordering.
> >
> > 1. https://docs.kernel.org/mm/split_page_table_lock.html
>
> Thanks for the confirmation, Liam. I'll make the changes and test over
> the weekend. If everything's still fine, I will post the next version
> with these and other requested changes on Monday.

Weekend testing results look good with all the changes. I'll post v4 shortly.
Thanks,
Suren.

>
> >
> > >
> > > >
> > > > remove & remove2 should be be detached in vma_prepare() or
> > > > vma_complete() as well?
> > >
> > > They are marked detached in vma_complete() (see
> > > https://lore.kernel.org/all/20230216051750.3125598-25-surenb@google.com/)
> > > and that should be enough. We should be safe as long as we mark them
> > > detached before unlocking the VMA.
> > >
> >
> > Right, Thanks.
> >
> > ...
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index ec2f8d0af280..f079e5bbcd57 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -674,6 +674,9 @@  int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		ret = dup_anon_vma(vma, next);
 		if (ret)
 			return ret;
+
+		/* Lock the VMA  before removing it */
+		vma_start_write(next);
 	}
 
 	init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
@@ -686,6 +689,7 @@  int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_prealloc(vmi))
 		goto nomem;
 
+	vma_start_write(vma);
 	vma_adjust_trans_huge(vma, start, end, 0);
 	/* VMA iterator points to previous, so set to start if necessary */
 	if (vma_iter_addr(vmi) != start)
@@ -725,6 +729,7 @@  int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_prealloc(vmi))
 		return -ENOMEM;
 
+	vma_start_write(vma);
 	init_vma_prep(&vp, vma);
 	vma_adjust_trans_huge(vma, start, end, 0);
 	vma_prepare(&vp);