diff mbox series

[1/6] mm/vmalloc: Prevent stale TLBs in fully utilized blocks

Message ID 20230523140002.575854344@linutronix.de (mailing list archive)
State New
Headers show
Series mm/vmalloc: Assorted fixes and improvements | expand

Commit Message

Thomas Gleixner May 23, 2023, 2:02 p.m. UTC
_vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
page are left in the system. This is required due to the lazy TLB flush
mechanism in vmalloc.

This is tried to achieve by walking the per CPU free lists, but those do
not contain fully utilized vmap blocks because they are removed from the
free list once the blocks free space became zero.

So the per CPU list iteration does not find the block and if the page was
mapped via such a block and the TLB has not yet been flushed, the guarantee
of _vm_unmap_aliases() that there are no stale TLBs after returning is
broken:

x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
vb_free(x)     // Unmaps page and marks in dirty_min/max range

// Page is reused
vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL

So instead of walking the per CPU free lists, walk the per CPU xarrays
which hold pointers to _all_ active blocks in the system including those
removed from the free lists.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 mm/vmalloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig May 23, 2023, 3:17 p.m. UTC | #1
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Thomas Gleixner May 23, 2023, 4:40 p.m. UTC | #2
On Tue, May 23 2023 at 17:17, Christoph Hellwig wrote:
> Looks good:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

This probably wants a Fixes tag and if I'm not completely off then this
problem exists since 2.6.28: db64fe02258f ("mm: rewrite vmap layer")
Uladzislau Rezki (Sony) May 23, 2023, 4:47 p.m. UTC | #3
On Tue, May 23, 2023 at 06:40:40PM +0200, Thomas Gleixner wrote:
> On Tue, May 23 2023 at 17:17, Christoph Hellwig wrote:
> > Looks good:
> >
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Thanks!
> 
> This probably wants a Fixes tag and if I'm not completely off then this
> problem exists since 2.6.28: db64fe02258f ("mm: rewrite vmap layer")
> 
Yes. This code is quite old and the commit, you pointed looks correct.

--
Uladzislau Rezki
Lorenzo Stoakes May 23, 2023, 7:18 p.m. UTC | #4
On Tue, May 23, 2023 at 04:02:11PM +0200, Thomas Gleixner wrote:
> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> page are left in the system. This is required due to the lazy TLB flush
> mechanism in vmalloc.
>
> This is tried to achieve by walking the per CPU free lists, but those do
> not contain fully utilized vmap blocks because they are removed from the
> free list once the blocks free space became zero.

Oh Lord.

>
> So the per CPU list iteration does not find the block and if the page was
> mapped via such a block and the TLB has not yet been flushed, the guarantee
> of _vm_unmap_aliases() that there are no stale TLBs after returning is
> broken:
>
> x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
> vb_free(x)     // Unmaps page and marks in dirty_min/max range
>
> // Page is reused
> vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL
>
> So instead of walking the per CPU free lists, walk the per CPU xarrays
> which hold pointers to _all_ active blocks in the system including those
> removed from the free lists.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  mm/vmalloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
>  	for_each_possible_cpu(cpu) {
>  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
>  		struct vmap_block *vb;
> +		unsigned long idx;
>
>  		rcu_read_lock();
> -		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> +		xa_for_each(&vbq->vmap_blocks, idx, vb) {
>  			spin_lock(&vb->lock);
>  			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
>  				unsigned long va_start = vb->va->va_start;
>
>

LGTM.

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Uladzislau Rezki (Sony) May 24, 2023, 9:19 a.m. UTC | #5
On Tue, May 23, 2023 at 08:18:10PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 23, 2023 at 04:02:11PM +0200, Thomas Gleixner wrote:
> > _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> > page are left in the system. This is required due to the lazy TLB flush
> > mechanism in vmalloc.
> >
> > This is tried to achieve by walking the per CPU free lists, but those do
> > not contain fully utilized vmap blocks because they are removed from the
> > free list once the blocks free space became zero.
> 
> Oh Lord.
> 
> >
> > So the per CPU list iteration does not find the block and if the page was
> > mapped via such a block and the TLB has not yet been flushed, the guarantee
> > of _vm_unmap_aliases() that there are no stale TLBs after returning is
> > broken:
> >
> > x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
> > vb_free(x)     // Unmaps page and marks in dirty_min/max range
> >
> > // Page is reused
> > vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL
> >
> > So instead of walking the per CPU free lists, walk the per CPU xarrays
> > which hold pointers to _all_ active blocks in the system including those
> > removed from the free lists.
> >
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  mm/vmalloc.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
> >  	for_each_possible_cpu(cpu) {
> >  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
> >  		struct vmap_block *vb;
> > +		unsigned long idx;
> >
> >  		rcu_read_lock();
> > -		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> > +		xa_for_each(&vbq->vmap_blocks, idx, vb) {
> >  			spin_lock(&vb->lock);
> >  			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
> >  				unsigned long va_start = vb->va->va_start;
> >
> >
> 
> LGTM.
> 
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Baoquan He May 24, 2023, 9:25 a.m. UTC | #6
On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> page are left in the system. This is required due to the lazy TLB flush
> mechanism in vmalloc.
> 
> This is tried to achieve by walking the per CPU free lists, but those do
> not contain fully utilized vmap blocks because they are removed from the
> free list once the blocks free space became zero.

The problem description is not accurate. This is tried to achieve for
va associated with vmap_block by walking the per CPU free lists, those
fully utilized vmap blocks can still be flushed in __purge_vmap_area_lazy()
by calculating the [min:max] of purge_vmap_area_list, because va of
vmap_blocks will be added to purge_vmap_area_list too via vb_free().

If we finally exclude the vmap_block va from purge list in
__purge_vmap_area_lazy(), then we can say that stale TLBs is missed to
flush. No?

IMHO, this is a preparation work for the vmap_block va excluding from
purge list flushing. Please correct me if I am wrong.

> 
> So the per CPU list iteration does not find the block and if the page was
> mapped via such a block and the TLB has not yet been flushed, the guarantee
> of _vm_unmap_aliases() that there are no stale TLBs after returning is
> broken:
> 
> x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
> vb_free(x)     // Unmaps page and marks in dirty_min/max range
> 
> // Page is reused
> vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL
> 
> So instead of walking the per CPU free lists, walk the per CPU xarrays
> which hold pointers to _all_ active blocks in the system including those
> removed from the free lists.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  mm/vmalloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
>  	for_each_possible_cpu(cpu) {
>  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
>  		struct vmap_block *vb;
> +		unsigned long idx;
>  
>  		rcu_read_lock();
> -		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> +		xa_for_each(&vbq->vmap_blocks, idx, vb) {
>  			spin_lock(&vb->lock);
>  			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
>  				unsigned long va_start = vb->va->va_start;
>
Baoquan He May 24, 2023, 9:32 a.m. UTC | #7
On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> page are left in the system. This is required due to the lazy TLB flush
> mechanism in vmalloc.
> 
> This is tried to achieve by walking the per CPU free lists, but those do
> not contain fully utilized vmap blocks because they are removed from the
> free list once the blocks free space became zero.
> 
> So the per CPU list iteration does not find the block and if the page was
> mapped via such a block and the TLB has not yet been flushed, the guarantee
> of _vm_unmap_aliases() that there are no stale TLBs after returning is
> broken:
> 
> x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
> vb_free(x)     // Unmaps page and marks in dirty_min/max range
> 
> // Page is reused
> vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL
> 
> So instead of walking the per CPU free lists, walk the per CPU xarrays
> which hold pointers to _all_ active blocks in the system including those
> removed from the free lists.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  mm/vmalloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
>  	for_each_possible_cpu(cpu) {
>  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
>  		struct vmap_block *vb;
> +		unsigned long idx;
>  
>  		rcu_read_lock();

Do we need to remove this rcu_read_xx() pair since it marks the RCU
read-side critical section on vbq-free list?

> -		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> +		xa_for_each(&vbq->vmap_blocks, idx, vb) {
>  			spin_lock(&vb->lock);
>  			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
>  				unsigned long va_start = vb->va->va_start;
>
Thomas Gleixner May 24, 2023, 9:51 a.m. UTC | #8
On Wed, May 24 2023 at 17:25, Baoquan He wrote:
> On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
>> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
>> page are left in the system. This is required due to the lazy TLB flush
>> mechanism in vmalloc.
>> 
>> This is tried to achieve by walking the per CPU free lists, but those do
>> not contain fully utilized vmap blocks because they are removed from the
>> free list once the blocks free space became zero.
>
> The problem description is not accurate. This is tried to achieve for
> va associated with vmap_block by walking the per CPU free lists, those
> fully utilized vmap blocks can still be flushed in __purge_vmap_area_lazy()
> by calculating the [min:max] of purge_vmap_area_list, because va of
> vmap_blocks will be added to purge_vmap_area_list too via vb_free().

No. The fully utilized block cannot be purged when there are still
active mappings on it. Again:

  X = vb_alloc()
...  
  Y = vb_alloc()
    vb->free -= order;
    if (!vb->vb_free)
       list_del(vb->free_list);
...
  vb_free(Y)
    vb->dirty += order;
    if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
       free_block(); 

So because $X is not yet unmapped the block is neither on the free list
nor on purge_vmap_area_list.

Thanks,

        tglx
Thomas Gleixner May 24, 2023, 9:52 a.m. UTC | #9
On Wed, May 24 2023 at 17:32, Baoquan He wrote:
> On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
>> @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
>>  	for_each_possible_cpu(cpu) {
>>  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
>>  		struct vmap_block *vb;
>> +		unsigned long idx;
>>  
>>  		rcu_read_lock();
>
> Do we need to remove this rcu_read_xx() pair since it marks the RCU
> read-side critical section on vbq-free list?

And what protects the xarray lookup?

>> -		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
>> +		xa_for_each(&vbq->vmap_blocks, idx, vb) {
>>  			spin_lock(&vb->lock);
>>  			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
>>  				unsigned long va_start = vb->va->va_start;
>>
Baoquan He May 24, 2023, 11:24 a.m. UTC | #10
On 05/24/23 at 11:51am, Thomas Gleixner wrote:
> On Wed, May 24 2023 at 17:25, Baoquan He wrote:
> > On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
> >> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> >> page are left in the system. This is required due to the lazy TLB flush
> >> mechanism in vmalloc.
> >> 
> >> This is tried to achieve by walking the per CPU free lists, but those do
> >> not contain fully utilized vmap blocks because they are removed from the
> >> free list once the blocks free space became zero.
> >
> > The problem description is not accurate. This is tried to achieve for
> > va associated with vmap_block by walking the per CPU free lists, those
> > fully utilized vmap blocks can still be flushed in __purge_vmap_area_lazy()
> > by calculating the [min:max] of purge_vmap_area_list, because va of
> > vmap_blocks will be added to purge_vmap_area_list too via vb_free().
> 
> No. The fully utilized block cannot be purged when there are still
> active mappings on it. Again:
> 
>   X = vb_alloc()
> ...  
>   Y = vb_alloc()
>     vb->free -= order;
>     if (!vb->vb_free)
>        list_del(vb->free_list);
> ...
>   vb_free(Y)
>     vb->dirty += order;
>     if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
>        free_block(); 


   vb_free(Y)
     vb->dirty += order;
     if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
        free_vmap_block(); 
        -->free_vmap_area_noflush()
           -->merge_or_add_vmap_area(va,
                &purge_vmap_area_root, &purge_vmap_area_list);

The last mapped region will be freed and added to purge list via
vb_free(), it will be flushed with other va in purge list. When it's
mapped via vb_alloc(), it's detached from vbq->free list. When it's
freed via vb_alloc(), it's added to purge list, and flushed, the thing
is duplicated flushing, no missing flush seen here?

> 
> So because $X is not yet unmapped the block is neither on the free list
> nor on purge_vmap_area_list.

Yeah, because $X is not yet unmapped, the block could have unmapped part
flushed, or unflushed. For unflushed part, it's got flushed with $X
altogether in the purge list.
Baoquan He May 24, 2023, 11:26 a.m. UTC | #11
On 05/24/23 at 07:24pm, Baoquan He wrote:
> On 05/24/23 at 11:51am, Thomas Gleixner wrote:
> > On Wed, May 24 2023 at 17:25, Baoquan He wrote:
> > > On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
> > >> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> > >> page are left in the system. This is required due to the lazy TLB flush
> > >> mechanism in vmalloc.
> > >> 
> > >> This is tried to achieve by walking the per CPU free lists, but those do
> > >> not contain fully utilized vmap blocks because they are removed from the
> > >> free list once the blocks free space became zero.
> > >
> > > The problem description is not accurate. This is tried to achieve for
> > > va associated with vmap_block by walking the per CPU free lists, those
> > > fully utilized vmap blocks can still be flushed in __purge_vmap_area_lazy()
> > > by calculating the [min:max] of purge_vmap_area_list, because va of
> > > vmap_blocks will be added to purge_vmap_area_list too via vb_free().
> > 
> > No. The fully utilized block cannot be purged when there are still
> > active mappings on it. Again:
> > 
> >   X = vb_alloc()
> > ...  
> >   Y = vb_alloc()
> >     vb->free -= order;
> >     if (!vb->vb_free)
> >        list_del(vb->free_list);
> > ...
> >   vb_free(Y)
> >     vb->dirty += order;
> >     if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
> >        free_block(); 
> 
> 
>    vb_free(Y)
>      vb->dirty += order;
>      if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
>         free_vmap_block(); 
>         -->free_vmap_area_noflush()
>            -->merge_or_add_vmap_area(va,
>                 &purge_vmap_area_root, &purge_vmap_area_list);
> 
> The last mapped region will be freed and added to purge list via
> vb_free(), it will be flushed with other va in purge list. When it's
> mapped via vb_alloc(), it's detached from vbq->free list. When it's
> freed via vb_alloc(), it's added to purge list, and flushed, the thing
            ~~~~~~~~vb_free(), typo, 
> is duplicated flushing, no missing flush seen here?
> 
> > 
> > So because $X is not yet unmapped the block is neither on the free list
> > nor on purge_vmap_area_list.
> 
> Yeah, because $X is not yet unmapped, the block could have unmapped part
> flushed, or unflushed. For unflushed part, it's got flushed with $X
> altogether in the purge list.
>
Uladzislau Rezki (Sony) May 24, 2023, 11:36 a.m. UTC | #12
On Wed, May 24, 2023 at 07:24:43PM +0800, Baoquan He wrote:
> On 05/24/23 at 11:51am, Thomas Gleixner wrote:
> > On Wed, May 24 2023 at 17:25, Baoquan He wrote:
> > > On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
> > >> _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
> > >> page are left in the system. This is required due to the lazy TLB flush
> > >> mechanism in vmalloc.
> > >> 
> > >> This is tried to achieve by walking the per CPU free lists, but those do
> > >> not contain fully utilized vmap blocks because they are removed from the
> > >> free list once the blocks free space became zero.
> > >
> > > The problem description is not accurate. This is tried to achieve for
> > > va associated with vmap_block by walking the per CPU free lists, those
> > > fully utilized vmap blocks can still be flushed in __purge_vmap_area_lazy()
> > > by calculating the [min:max] of purge_vmap_area_list, because va of
> > > vmap_blocks will be added to purge_vmap_area_list too via vb_free().
> > 
> > No. The fully utilized block cannot be purged when there are still
> > active mappings on it. Again:
> > 
> >   X = vb_alloc()
> > ...  
> >   Y = vb_alloc()
> >     vb->free -= order;
> >     if (!vb->vb_free)
> >        list_del(vb->free_list);
> > ...
> >   vb_free(Y)
> >     vb->dirty += order;
> >     if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
> >        free_block(); 
> 
> 
>    vb_free(Y)
>      vb->dirty += order;
>      if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
>         free_vmap_block(); 
>         -->free_vmap_area_noflush()
>            -->merge_or_add_vmap_area(va,
>                 &purge_vmap_area_root, &purge_vmap_area_list);
> 
> The last mapped region will be freed and added to purge list via
> vb_free(), it will be flushed with other va in purge list. When it's
> mapped via vb_alloc(), it's detached from vbq->free list. When it's
> freed via vb_alloc(), it's added to purge list, and flushed, the thing
> is duplicated flushing, no missing flush seen here?
> 
> > 
> > So because $X is not yet unmapped the block is neither on the free list
> > nor on purge_vmap_area_list.
> 
> Yeah, because $X is not yet unmapped, the block could have unmapped part
> flushed, or unflushed. For unflushed part, it's got flushed with $X
> altogether in the purge list.
> 
If we gurantee that vb is fully flushed, then just do not add it it
purge list:

@@ -2086,12 +2090,7 @@ static void free_vmap_block(struct vmap_block *vb)
        BUG_ON(tmp != vb);

        z = addr_to_cvz(vb->va->va_start);
-
-       spin_lock(&z->busy.lock);
-       unlink_va(vb->va, &z->busy.root);
-       spin_unlock(&z->busy.lock);
-
-       free_vmap_area_noflush(vb->va);
+       free_vmap_area(vb->va);
        kfree_rcu(vb, rcu_head);
 }

and directly return back into global vmap heap.

--
Uladzislau Rezki
Thomas Gleixner May 24, 2023, 12:44 p.m. UTC | #13
On Wed, May 24 2023 at 19:24, Baoquan He wrote:
> On 05/24/23 at 11:51am, Thomas Gleixner wrote:
>    vb_free(Y)
>      vb->dirty += order;
>      if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
>         free_vmap_block(); 
>         -->free_vmap_area_noflush()
>            -->merge_or_add_vmap_area(va,
>                 &purge_vmap_area_root, &purge_vmap_area_list);

This is irrelevant. The path is _NOT_ taken. You even copied the
comment:

       if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_

Did you actually read what I wrote?

Again: It _CANNOT_ be on the purge list because it has active mappings:

1  X = vb_alloc()
   ...  
   Y = vb_alloc()
     vb->free -= order;               // Free space goes to 0
     if (!vb->vb_free)
2      list_del(vb->free_list);       // Block is removed from free list
   ...
   vb_free(Y)
     vb->dirty += order;
3    if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
                                       // because #1 $X is still mapped
                                       // so block is _NOT_ freed and
                                       // _NOT_ put on the purge list

4   unmap_aliases()
     walk_free_list()           // Does not find it because of #2
     walk_purge_list()          // Does not find it because of #3

If the resulting flush range is not covering the $Y TLBs then stale TLBs
stay around.

The xarray walk finds it and guarantees that the TLBs are gone when
unmap_aliases() returns, which is the whole purpose of that function.

Thanks,

        tglx
Thomas Gleixner May 24, 2023, 12:49 p.m. UTC | #14
On Wed, May 24 2023 at 13:36, Uladzislau Rezki wrote:
> On Wed, May 24, 2023 at 07:24:43PM +0800, Baoquan He wrote:
>> Yeah, because $X is not yet unmapped, the block could have unmapped part
>> flushed, or unflushed. For unflushed part, it's got flushed with $X
>> altogether in the purge list.
>> 
> If we gurantee that vb is fully flushed, then just do not add it it
> purge list:

There is no such guarantee today and there wont be one even with patch
3/6 applied because:

unmap_aliases()
  flush(dirty_min/max)
  reset(dirty_min/max)

vb_free(lastmapping)
  vunmap_range_noflush();
  vb->dirty += order;
  if (vb->dirty == VMAP_BBMAP_BITS)
     free_block(vb);    <-- the TLBs related to @lastmapping are
                            _NOT_ yet flushed.

Thanks,

        tglx
Baoquan He May 24, 2023, 1:41 p.m. UTC | #15
On 05/24/23 at 02:44pm, Thomas Gleixner wrote:
> On Wed, May 24 2023 at 19:24, Baoquan He wrote:
> > On 05/24/23 at 11:51am, Thomas Gleixner wrote:
> >    vb_free(Y)
> >      vb->dirty += order;
> >      if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
> >         free_vmap_block(); 
> >         -->free_vmap_area_noflush()
> >            -->merge_or_add_vmap_area(va,
> >                 &purge_vmap_area_root, &purge_vmap_area_list);
> 
> This is irrelevant. The path is _NOT_ taken. You even copied the
> comment:

Ah, just copied the whole, didn't notice that. I am not a scrupulous
person.

> 
>        if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
> 
> Did you actually read what I wrote?
> 
> Again: It _CANNOT_ be on the purge list because it has active mappings:
> 
> 1  X = vb_alloc()
>    ...  
>    Y = vb_alloc()
>      vb->free -= order;               // Free space goes to 0
>      if (!vb->vb_free)
> 2      list_del(vb->free_list);       // Block is removed from free list
>    ...
>    vb_free(Y)
>      vb->dirty += order;
> 3    if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
>                                        // because #1 $X is still mapped
>                                        // so block is _NOT_ freed and
>                                        // _NOT_ put on the purge list

So what if $X is unmapped via vb_free($X)? Does the condition satisfied
and can the vb put into purge list?

In your above example, $Y's flush is deferred, but not missed?


> 
> 4   unmap_aliases()
>      walk_free_list()           // Does not find it because of #2
>      walk_purge_list()          // Does not find it because of #3
> 
> If the resulting flush range is not covering the $Y TLBs then stale TLBs
> stay around.

OK, your mean the TLB of $Y will stay around after vb_free() until
the whole vb becomes dirty, and fix that in this patch, you are right.
vm_unmap_aliases() may need try to flush all unmapped ranges in
this case but failed on $Y, while the page which is being reused has the
old alias of $Y.

My thought was attracted to the repeated flush of vmap_block va on purge
list.

By the way, you don't fix issue that in vm_reset_perms(), the direct map 
range will be accumulated with vb va and purge va and could produce
flushing range including huge gap, do you still plan to fix that? I
remember you said you will use array to gather ranges and flush them one
by one.

> 
> The xarray walk finds it and guarantees that the TLBs are gone when
> unmap_aliases() returns, which is the whole purpose of that function.
> 
> Thanks,
> 
>         tglx
>
Baoquan He May 24, 2023, 2:10 p.m. UTC | #16
On 05/24/23 at 11:52am, Thomas Gleixner wrote:
> On Wed, May 24 2023 at 17:32, Baoquan He wrote:
> > On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
> >> @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
> >>  	for_each_possible_cpu(cpu) {
> >>  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
> >>  		struct vmap_block *vb;
> >> +		unsigned long idx;
> >>  
> >>  		rcu_read_lock();
> >
> > Do we need to remove this rcu_read_xx() pair since it marks the RCU
> > read-side critical section on vbq-free list?
> 
> And what protects the xarray lookup?

We have put rcu_read_lock() pair around the xas_find(). And it will find
into xarray for each iteration item. We won't lose the connection to the
next element like list adding or deleting? not very sure, I could be
wrong.

xa_for_each()
-->xa_for_each_start()
   -->xa_for_each_range()
      -->xa_find()
       
void *xa_find(struct xarray *xa, unsigned long *indexp,
                        unsigned long max, xa_mark_t filter)
{
...... 
        rcu_read_lock();
        do {
                if ((__force unsigned int)filter < XA_MAX_MARKS)
                        entry = xas_find_marked(&xas, max, filter);
                else
                        entry = xas_find(&xas, max);
        } while (xas_retry(&xas, entry));
        rcu_read_unlock();
 
        if (entry)
                *indexp = xas.xa_index;
        return entry;
}

> 
> >> -		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> >> +		xa_for_each(&vbq->vmap_blocks, idx, vb) {
> >>  			spin_lock(&vb->lock);
> >>  			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
> >>  				unsigned long va_start = vb->va->va_start;
> >> 
>
Thomas Gleixner May 24, 2023, 2:31 p.m. UTC | #17
On Wed, May 24 2023 at 21:41, Baoquan He wrote:
> On 05/24/23 at 02:44pm, Thomas Gleixner wrote:
>> On Wed, May 24 2023 at 19:24, Baoquan He wrote:
>> Again: It _CANNOT_ be on the purge list because it has active mappings:
>> 
>> 1  X = vb_alloc()
>>    ...  
>>    Y = vb_alloc()
>>      vb->free -= order;               // Free space goes to 0
>>      if (!vb->vb_free)
>> 2      list_del(vb->free_list);       // Block is removed from free list
>>    ...
>>    vb_free(Y)
>>      vb->dirty += order;
>> 3    if (vb->dirty == VMAP_BBMAP_BITS) // Condition is _false_
>>                                        // because #1 $X is still mapped
>>                                        // so block is _NOT_ freed and
>>                                        // _NOT_ put on the purge list
>
> So what if $X is unmapped via vb_free($X)? Does the condition satisfied
> and can the vb put into purge list?

Yes, but it is _irrelevant_ for the problem at hand.

> In your above example, $Y's flush is deferred, but not missed?

Yes, but that violates the guarantee of vm_unmap_aliases():

 * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
 * to amortize TLB flushing overheads. What this means is that any page you
 * have now, may, in a former life, have been mapped into kernel virtual
 * address by the vmap layer and so there might be some CPUs with TLB entries
 * still referencing that page (additional to the regular 1:1 kernel mapping).
 *
 * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
 * be sure that none of the pages we have control over will have any aliases
 * from the vmap layer.

>> 4   unmap_aliases()
>>      walk_free_list()           // Does not find it because of #2
>>      walk_purge_list()          // Does not find it because of #3
>> 
>> If the resulting flush range is not covering the $Y TLBs then stale TLBs
>> stay around.
>
> OK, your mean the TLB of $Y will stay around after vb_free() until
> the whole vb becomes dirty, and fix that in this patch, you are right.
> vm_unmap_aliases() may need try to flush all unmapped ranges in
> this case but failed on $Y, while the page which is being reused has the
> old alias of $Y.

vm_unmap_aliases() _must_ guarantee that the old TLBs for $Y are gone.

> My thought was attracted to the repeated flush of vmap_block va on purge
> list.
>
> By the way, you don't fix issue that in vm_reset_perms(), the direct map 
> range will be accumulated with vb va and purge va and could produce
> flushing range including huge gap, do you still plan to fix that? I
> remember you said you will use array to gather ranges and flush them one
> by one.

One thing at a time. This series is a prerequisite.

Thanks,

        tglx
Thomas Gleixner May 24, 2023, 2:35 p.m. UTC | #18
On Wed, May 24 2023 at 22:10, Baoquan He wrote:
> On 05/24/23 at 11:52am, Thomas Gleixner wrote:
>> On Wed, May 24 2023 at 17:32, Baoquan He wrote:
>> > On 05/23/23 at 04:02pm, Thomas Gleixner wrote:
>> >> @@ -2236,9 +2236,10 @@ static void _vm_unmap_aliases(unsigned l
>> >>  	for_each_possible_cpu(cpu) {
>> >>  		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
>> >>  		struct vmap_block *vb;
>> >> +		unsigned long idx;
>> >>  
>> >>  		rcu_read_lock();
>> >
>> > Do we need to remove this rcu_read_xx() pair since it marks the RCU
>> > read-side critical section on vbq-free list?
>> 
>> And what protects the xarray lookup?
>
> We have put rcu_read_lock() pair around the xas_find(). And it will find
> into xarray for each iteration item. We won't lose the connection to the
> next element like list adding or deleting? not very sure, I could be
> wrong.
>
> xa_for_each()
> -->xa_for_each_start()
>    -->xa_for_each_range()
>       -->xa_find()

I know how xarray works. No need to copy the code.

rcu_read_lock() inside xa_find() protects the search, but it does not
protect the returned entry, which might go away right after xa_find()
does rcu_read_unlock().

Thanks,

        tglx
diff mbox series

Patch

--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2236,9 +2236,10 @@  static void _vm_unmap_aliases(unsigned l
 	for_each_possible_cpu(cpu) {
 		struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
 		struct vmap_block *vb;
+		unsigned long idx;
 
 		rcu_read_lock();
-		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
+		xa_for_each(&vbq->vmap_blocks, idx, vb) {
 			spin_lock(&vb->lock);
 			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
 				unsigned long va_start = vb->va->va_start;