[v3,07/11] mm: vmalloc: Offload free_vmap_area_lock lock

Message ID	20240102184633.748113-8-urezki@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org>, Baoquan He <bhe@redhat.com>, Lorenzo Stoakes <lstoakes@gmail.com>, Christoph Hellwig <hch@infradead.org>, Matthew Wilcox <willy@infradead.org>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Dave Chinner <david@fromorbit.com>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Subject: [PATCH v3 07/11] mm: vmalloc: Offload free_vmap_area_lock lock Date: Tue, 2 Jan 2024 19:46:29 +0100 Message-Id: <20240102184633.748113-8-urezki@gmail.com> In-Reply-To: <20240102184633.748113-1-urezki@gmail.com> References: <20240102184633.748113-1-urezki@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Mitigate a vmap lock contention v3 \| expand [v3,00/11] Mitigate a vmap lock contention v3 [v3,01/11] mm: vmalloc: Add va_alloc() helper [v3,02/11] mm: vmalloc: Rename adjust_va_to_fit_type() function [v3,03/11] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c [v3,04/11] mm: vmalloc: Remove global vmap_area_root rb-tree [v3,05/11] mm/vmalloc: remove vmap_area_list [v3,06/11] mm: vmalloc: Remove global purge_vmap_area_root rb-tree [v3,07/11] mm: vmalloc: Offload free_vmap_area_lock lock [v3,08/11] mm: vmalloc: Support multiple nodes in vread_iter [v3,09/11] mm: vmalloc: Support multiple nodes in vmallocinfo [v3,10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system [v3,11/11] mm: vmalloc: Add a shrinker to drain vmap pools

Uladzislau Rezki Jan. 2, 2024, 6:46 p.m. UTC

Concurrent access to a global vmap space is a bottle-neck.
We can simulate a high contention by running a vmalloc test
suite.

To address it, introduce an effective vmap node logic. Each
node behaves as independent entity. When a node is accessed
it serves a request directly(if possible) from its pool.

This model has a size based pool for requests, i.e. pools are
serialized and populated based on object size and real demand.
A maximum object size that pool can handle is set to 256 pages.

This technique reduces a pressure on the global vmap lock.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 387 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 342 insertions(+), 45 deletions(-)

Hillf Danton Jan. 3, 2024, 11:08 a.m. UTC | #1

On Tue,  2 Jan 2024 19:46:29 +0100 Uladzislau Rezki <urezki@gmail.com>
> +static void
> +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> +{
> +	struct vmap_area *va, *nva;
> +	struct list_head decay_list;
> +	struct rb_root decay_root;
> +	unsigned long n_decay;
> +	int i;
> +
> +	decay_root = RB_ROOT;
> +	INIT_LIST_HEAD(&decay_list);
> +
> +	for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> +		struct list_head tmp_list;
> +
> +		if (list_empty(&vn->pool[i].head))
> +			continue;
> +
> +		INIT_LIST_HEAD(&tmp_list);
> +
> +		/* Detach the pool, so no-one can access it. */
> +		spin_lock(&vn->pool_lock);
> +		list_replace_init(&vn->pool[i].head, &tmp_list);
> +		spin_unlock(&vn->pool_lock);
> +
> +		if (full_decay)
> +			WRITE_ONCE(vn->pool[i].len, 0);
> +
> +		/* Decay a pool by ~25% out of left objects. */
> +		n_decay = vn->pool[i].len >> 2;
> +
> +		list_for_each_entry_safe(va, nva, &tmp_list, list) {
> +			list_del_init(&va->list);
> +			merge_or_add_vmap_area(va, &decay_root, &decay_list);
> +
> +			if (!full_decay) {
> +				WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> +
> +				if (!--n_decay)
> +					break;
> +			}
> +		}
> +
> +		/* Attach the pool back if it has been partly decayed. */
> +		if (!full_decay && !list_empty(&tmp_list)) {
> +			spin_lock(&vn->pool_lock);
> +			list_replace_init(&tmp_list, &vn->pool[i].head);
> +			spin_unlock(&vn->pool_lock);
> +		}

Failure of working out why list_splice() was not used here in case of
non-empty vn->pool[i].head, after staring ten minutes.
> +	}
> +
> +	reclaim_list_global(&decay_list);
> +}

Uladzislau Rezki Jan. 3, 2024, 3:47 p.m. UTC | #2

On Wed, Jan 03, 2024 at 07:08:32PM +0800, Hillf Danton wrote:
> On Tue,  2 Jan 2024 19:46:29 +0100 Uladzislau Rezki <urezki@gmail.com>
> > +static void
> > +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> > +{
> > +	struct vmap_area *va, *nva;
> > +	struct list_head decay_list;
> > +	struct rb_root decay_root;
> > +	unsigned long n_decay;
> > +	int i;
> > +
> > +	decay_root = RB_ROOT;
> > +	INIT_LIST_HEAD(&decay_list);
> > +
> > +	for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> > +		struct list_head tmp_list;
> > +
> > +		if (list_empty(&vn->pool[i].head))
> > +			continue;
> > +
> > +		INIT_LIST_HEAD(&tmp_list);
> > +
> > +		/* Detach the pool, so no-one can access it. */
> > +		spin_lock(&vn->pool_lock);
> > +		list_replace_init(&vn->pool[i].head, &tmp_list);
> > +		spin_unlock(&vn->pool_lock);
> > +
> > +		if (full_decay)
> > +			WRITE_ONCE(vn->pool[i].len, 0);
> > +
> > +		/* Decay a pool by ~25% out of left objects. */
> > +		n_decay = vn->pool[i].len >> 2;
> > +
> > +		list_for_each_entry_safe(va, nva, &tmp_list, list) {
> > +			list_del_init(&va->list);
> > +			merge_or_add_vmap_area(va, &decay_root, &decay_list);
> > +
> > +			if (!full_decay) {
> > +				WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> > +
> > +				if (!--n_decay)
> > +					break;
> > +			}
> > +		}
> > +
> > +		/* Attach the pool back if it has been partly decayed. */
> > +		if (!full_decay && !list_empty(&tmp_list)) {
> > +			spin_lock(&vn->pool_lock);
> > +			list_replace_init(&tmp_list, &vn->pool[i].head);
> > +			spin_unlock(&vn->pool_lock);
> > +		}
> 
> Failure of working out why list_splice() was not used here in case of
> non-empty vn->pool[i].head, after staring ten minutes.
>
The vn->pool[i].head is always empty here because we have detached it above
and initialized. Concurrent decay and populate also is not possible because
both is done by only one context.

--
Uladzislau Rezki

Dave Chinner Jan. 11, 2024, 9:02 a.m. UTC | #3

On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> Concurrent access to a global vmap space is a bottle-neck.
> We can simulate a high contention by running a vmalloc test
> suite.
> 
> To address it, introduce an effective vmap node logic. Each
> node behaves as independent entity. When a node is accessed
> it serves a request directly(if possible) from its pool.
> 
> This model has a size based pool for requests, i.e. pools are
> serialized and populated based on object size and real demand.
> A maximum object size that pool can handle is set to 256 pages.
> 
> This technique reduces a pressure on the global vmap lock.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Why not use a llist for this? That gets rid of the need for a
new pool_lock altogether...

Cheers,

Dave.

Uladzislau Rezki Jan. 11, 2024, 3:54 p.m. UTC | #4

On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > Concurrent access to a global vmap space is a bottle-neck.
> > We can simulate a high contention by running a vmalloc test
> > suite.
> > 
> > To address it, introduce an effective vmap node logic. Each
> > node behaves as independent entity. When a node is accessed
> > it serves a request directly(if possible) from its pool.
> > 
> > This model has a size based pool for requests, i.e. pools are
> > serialized and populated based on object size and real demand.
> > A maximum object size that pool can handle is set to 256 pages.
> > 
> > This technique reduces a pressure on the global vmap lock.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Why not use a llist for this? That gets rid of the need for a
> new pool_lock altogether...
> 
Initially i used the llist. I have changed it because i keep track
of objects per a pool to decay it later. I do not find these locks
as contented one therefore i did not think much.

Anyway, i will have a look at this to see if llist is easy to go with
or not. If so i will send out a separate patch.

Thanks!

--
Uladzislau Rezki

Dave Chinner Jan. 11, 2024, 8:37 p.m. UTC | #5

On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote:
> On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > > Concurrent access to a global vmap space is a bottle-neck.
> > > We can simulate a high contention by running a vmalloc test
> > > suite.
> > > 
> > > To address it, introduce an effective vmap node logic. Each
> > > node behaves as independent entity. When a node is accessed
> > > it serves a request directly(if possible) from its pool.
> > > 
> > > This model has a size based pool for requests, i.e. pools are
> > > serialized and populated based on object size and real demand.
> > > A maximum object size that pool can handle is set to 256 pages.
> > > 
> > > This technique reduces a pressure on the global vmap lock.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > 
> > Why not use a llist for this? That gets rid of the need for a
> > new pool_lock altogether...
> > 
> Initially i used the llist. I have changed it because i keep track
> of objects per a pool to decay it later. I do not find these locks
> as contented one therefore i did not think much.

Ok. I've used llist and an atomic counter to track the list length
in the past.

But is the list length even necessary? It seems to me that it is
only used by the shrinker to determine how many objects are on the
lists for scanning, and I'm not sure that's entirely necessary given
the way the current global shrinker works (i.e. completely unfair to
low numbered nodes due to scan loop start bias).

> Anyway, i will have a look at this to see if llist is easy to go with
> or not. If so i will send out a separate patch.

Sounds good, it was just something that crossed my mind given the
pattern of "producer adds single items, consumer detaches entire
list, processes it and reattaches remainder" is a perfect match for
the llist structure.

Cheers,

Dave.

Uladzislau Rezki Jan. 12, 2024, 12:18 p.m. UTC | #6

On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote:
> On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote:
> > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > Concurrent access to a global vmap space is a bottle-neck.
> > > > We can simulate a high contention by running a vmalloc test
> > > > suite.
> > > > 
> > > > To address it, introduce an effective vmap node logic. Each
> > > > node behaves as independent entity. When a node is accessed
> > > > it serves a request directly(if possible) from its pool.
> > > > 
> > > > This model has a size based pool for requests, i.e. pools are
> > > > serialized and populated based on object size and real demand.
> > > > A maximum object size that pool can handle is set to 256 pages.
> > > > 
> > > > This technique reduces a pressure on the global vmap lock.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > 
> > > Why not use a llist for this? That gets rid of the need for a
> > > new pool_lock altogether...
> > > 
> > Initially i used the llist. I have changed it because i keep track
> > of objects per a pool to decay it later. I do not find these locks
> > as contented one therefore i did not think much.
> 
> Ok. I've used llist and an atomic counter to track the list length
> in the past.
> 
> But is the list length even necessary? It seems to me that it is
> only used by the shrinker to determine how many objects are on the
> lists for scanning, and I'm not sure that's entirely necessary given
> the way the current global shrinker works (i.e. completely unfair to
> low numbered nodes due to scan loop start bias).
> 
I use the length to decay pools by certain percentage, currently it is
25%, so i need to know number of objects. It is done in the purge path.
As for shrinker, once it hits us we drain pools entirely.

> > Anyway, i will have a look at this to see if llist is easy to go with
> > or not. If so i will send out a separate patch.
> 
> Sounds good, it was just something that crossed my mind given the
> pattern of "producer adds single items, consumer detaches entire
> list, processes it and reattaches remainder" is a perfect match for
> the llist structure.
> 
The llist_del_first() has to be serialized. For this purpose a per-cpu
pool would work or kind of "in_use" atomic that protects concurrent
removing.

If we detach entire llist, then we need to keep track of last node
to add it later as a "batch" to already existing/populated list.

Thanks four looking!

--
Uladzislau Rezki

Dave Chinner Jan. 16, 2024, 10:12 p.m. UTC | #7

On Fri, Jan 12, 2024 at 01:18:27PM +0100, Uladzislau Rezki wrote:
> On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote:
> > On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> > > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > Concurrent access to a global vmap space is a bottle-neck.
> > > > > We can simulate a high contention by running a vmalloc test
> > > > > suite.
> > > > > 
> > > > > To address it, introduce an effective vmap node logic. Each
> > > > > node behaves as independent entity. When a node is accessed
> > > > > it serves a request directly(if possible) from its pool.
> > > > > 
> > > > > This model has a size based pool for requests, i.e. pools are
> > > > > serialized and populated based on object size and real demand.
> > > > > A maximum object size that pool can handle is set to 256 pages.
> > > > > 
> > > > > This technique reduces a pressure on the global vmap lock.
> > > > > 
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > 
> > > > Why not use a llist for this? That gets rid of the need for a
> > > > new pool_lock altogether...
> > > > 
> > > Initially i used the llist. I have changed it because i keep track
> > > of objects per a pool to decay it later. I do not find these locks
> > > as contented one therefore i did not think much.
> > 
> > Ok. I've used llist and an atomic counter to track the list length
> > in the past.
> > 
> > But is the list length even necessary? It seems to me that it is
> > only used by the shrinker to determine how many objects are on the
> > lists for scanning, and I'm not sure that's entirely necessary given
> > the way the current global shrinker works (i.e. completely unfair to
> > low numbered nodes due to scan loop start bias).
> > 
> I use the length to decay pools by certain percentage, currently it is
> 25%, so i need to know number of objects. It is done in the purge path.
> As for shrinker, once it hits us we drain pools entirely.

Why does purge need to be different to shrinking?

But, regardless, you can still use llist with an atomic counter to
do this - there is no need for a spin lock at all.

> > > Anyway, i will have a look at this to see if llist is easy to go with
> > > or not. If so i will send out a separate patch.
> > 
> > Sounds good, it was just something that crossed my mind given the
> > pattern of "producer adds single items, consumer detaches entire
> > list, processes it and reattaches remainder" is a perfect match for
> > the llist structure.
> > 
> The llist_del_first() has to be serialized. For this purpose a per-cpu
> pool would work or kind of "in_use" atomic that protects concurrent
> removing.

So don't use llist_del_first().

> If we detach entire llist, then we need to keep track of last node
> to add it later as a "batch" to already existing/populated list.

Why? I haven't see any need for ordering these lists which would
requiring strict tail-add ordered semantics.

Cheers,

Dave.

Uladzislau Rezki Jan. 18, 2024, 6:15 p.m. UTC | #8

On Wed, Jan 17, 2024 at 09:12:26AM +1100, Dave Chinner wrote:
> On Fri, Jan 12, 2024 at 01:18:27PM +0100, Uladzislau Rezki wrote:
> > On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote:
> > > > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > Concurrent access to a global vmap space is a bottle-neck.
> > > > > > We can simulate a high contention by running a vmalloc test
> > > > > > suite.
> > > > > > 
> > > > > > To address it, introduce an effective vmap node logic. Each
> > > > > > node behaves as independent entity. When a node is accessed
> > > > > > it serves a request directly(if possible) from its pool.
> > > > > > 
> > > > > > This model has a size based pool for requests, i.e. pools are
> > > > > > serialized and populated based on object size and real demand.
> > > > > > A maximum object size that pool can handle is set to 256 pages.
> > > > > > 
> > > > > > This technique reduces a pressure on the global vmap lock.
> > > > > > 
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > 
> > > > > Why not use a llist for this? That gets rid of the need for a
> > > > > new pool_lock altogether...
> > > > > 
> > > > Initially i used the llist. I have changed it because i keep track
> > > > of objects per a pool to decay it later. I do not find these locks
> > > > as contented one therefore i did not think much.
> > > 
> > > Ok. I've used llist and an atomic counter to track the list length
> > > in the past.
> > > 
> > > But is the list length even necessary? It seems to me that it is
> > > only used by the shrinker to determine how many objects are on the
> > > lists for scanning, and I'm not sure that's entirely necessary given
> > > the way the current global shrinker works (i.e. completely unfair to
> > > low numbered nodes due to scan loop start bias).
> > > 
> > I use the length to decay pools by certain percentage, currently it is
> > 25%, so i need to know number of objects. It is done in the purge path.
> > As for shrinker, once it hits us we drain pools entirely.
> 
> Why does purge need to be different to shrinking?
> 
> But, regardless, you can still use llist with an atomic counter to
> do this - there is no need for a spin lock at all.
> 
As i pointed earlier, i will have a look at it.

> > > > Anyway, i will have a look at this to see if llist is easy to go with
> > > > or not. If so i will send out a separate patch.
> > > 
> > > Sounds good, it was just something that crossed my mind given the
> > > pattern of "producer adds single items, consumer detaches entire
> > > list, processes it and reattaches remainder" is a perfect match for
> > > the llist structure.
> > > 
> > The llist_del_first() has to be serialized. For this purpose a per-cpu
> > pool would work or kind of "in_use" atomic that protects concurrent
> > removing.
> 
> So don't use llist_del_first().
> 
> > If we detach entire llist, then we need to keep track of last node
> > to add it later as a "batch" to already existing/populated list.
> 
> Why? I haven't see any need for ordering these lists which would
> requiring strict tail-add ordered semantics.
> 
I mean the following:

1. first = llist_del_all(&example);
2. last = llist_reverse_order(first);

4. va = __llist_del_first(first);

/*
 * "example" might not be empty, use the batch. Otherwise
 * we loose the entries "example" pointed to.
 */
3. llist_add_batch(first, last, &example);

--
Uladzislau Rezki

Baoquan He Feb. 8, 2024, 12:25 a.m. UTC | #9

On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
......
> +static struct vmap_area *
> +node_alloc(unsigned long size, unsigned long align,
> +		unsigned long vstart, unsigned long vend,
> +		unsigned long *addr, unsigned int *vn_id)
> +{
> +	struct vmap_area *va;
> +
> +	*vn_id = 0;
> +	*addr = vend;
> +
> +	/*
> +	 * Fallback to a global heap if not vmalloc or there
> +	 * is only one node.
> +	 */
> +	if (vstart != VMALLOC_START || vend != VMALLOC_END ||
> +			nr_vmap_nodes == 1)
> +		return NULL;
> +
> +	*vn_id = raw_smp_processor_id() % nr_vmap_nodes;
> +	va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend);
> +	*vn_id = encode_vn_id(*vn_id);
> +
> +	if (va)
> +		*addr = va->va_start;
> +
> +	return va;
> +}
> +
>  /*
>   * Allocate a region of KVA of the specified size and alignment, within the
>   * vstart and vend.
> @@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	struct vmap_area *va;
>  	unsigned long freed;
>  	unsigned long addr;
> +	unsigned int vn_id;
>  	int purged = 0;
>  	int ret;
>  
> @@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  		return ERR_PTR(-EBUSY);
>  
>  	might_sleep();
> -	gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
>  
> -	va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> -	if (unlikely(!va))
> -		return ERR_PTR(-ENOMEM);
> +	/*
> +	 * If a VA is obtained from a global heap(if it fails here)
> +	 * it is anyway marked with this "vn_id" so it is returned
> +	 * to this pool's node later. Such way gives a possibility
> +	 * to populate pools based on users demand.
> +	 *
> +	 * On success a ready to go VA is returned.
> +	 */
> +	va = node_alloc(size, align, vstart, vend, &addr, &vn_id);

Sorry for late checking.

Here, if no available va got, e.g a empty vp, still we will get an
effective vn_id with the current cpu_id for VMALLOC region allocation
request.

> +	if (!va) {
> +		gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
> +
> +		va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> +		if (unlikely(!va))
> +			return ERR_PTR(-ENOMEM);
> +	}
>  
>  	/*
>  	 * Only scan the relevant parts containing pointers to other objects
> @@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
>  
>  retry:
> -	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> -	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> -		size, align, vstart, vend);
> -	spin_unlock(&free_vmap_area_lock);
> +	if (addr == vend) {
> +		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> +		addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> +			size, align, vstart, vend);

Then, here, we will get an available va from random location, but its
vn_id is from the current cpu.

Then in purge_vmap_node(), we will decode the vn_id stored in va->flags,
and add the relevant va into vn->pool[] according to the vn_id. The
worst case could be most of va in vn->pool[] are not corresponding to
the vmap_nodes they belongs to. It doesn't matter? Should we adjust the
code of vn_id assigning in node_alloc(), or I missed anything?

> +		spin_unlock(&free_vmap_area_lock);
> +	}
>  
>  	trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);
>  
> @@ -1677,7 +1862,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->va_start = addr;
>  	va->va_end = addr + size;
>  	va->vm = NULL;
> -	va->flags = va_flags;
> +	va->flags = (va_flags | vn_id);
>  
>  	vn = addr_to_node(va->va_start);
>  
> @@ -1770,63 +1955,135 @@ static DEFINE_MUTEX(vmap_purge_lock);
>  static void purge_fragmented_blocks_allcpus(void);
>  static cpumask_t purge_nodes;
>  
> -/*
> - * Purges all lazily-freed vmap areas.
> - */
> -static unsigned long
> -purge_vmap_node(struct vmap_node *vn)
> +static void
> +reclaim_list_global(struct list_head *head)
>  {
> -	unsigned long num_purged_areas = 0;
> -	struct vmap_area *va, *n_va;
> +	struct vmap_area *va, *n;
>  
> -	if (list_empty(&vn->purge_list))
> -		return 0;
> +	if (list_empty(head))
> +		return;
>  
>  	spin_lock(&free_vmap_area_lock);
> +	list_for_each_entry_safe(va, n, head, list)
> +		merge_or_add_vmap_area_augment(va,
> +			&free_vmap_area_root, &free_vmap_area_list);
> +	spin_unlock(&free_vmap_area_lock);
> +}
> +
> +static void
> +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> +{
> +	struct vmap_area *va, *nva;
> +	struct list_head decay_list;
> +	struct rb_root decay_root;
> +	unsigned long n_decay;
> +	int i;
> +
> +	decay_root = RB_ROOT;
> +	INIT_LIST_HEAD(&decay_list);
> +
> +	for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> +		struct list_head tmp_list;
> +
> +		if (list_empty(&vn->pool[i].head))
> +			continue;
> +
> +		INIT_LIST_HEAD(&tmp_list);
> +
> +		/* Detach the pool, so no-one can access it. */
> +		spin_lock(&vn->pool_lock);
> +		list_replace_init(&vn->pool[i].head, &tmp_list);
> +		spin_unlock(&vn->pool_lock);
> +
> +		if (full_decay)
> +			WRITE_ONCE(vn->pool[i].len, 0);
> +
> +		/* Decay a pool by ~25% out of left objects. */
> +		n_decay = vn->pool[i].len >> 2;
> +
> +		list_for_each_entry_safe(va, nva, &tmp_list, list) {
> +			list_del_init(&va->list);
> +			merge_or_add_vmap_area(va, &decay_root, &decay_list);
> +
> +			if (!full_decay) {
> +				WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> +
> +				if (!--n_decay)
> +					break;
> +			}
> +		}
> +
> +		/* Attach the pool back if it has been partly decayed. */
> +		if (!full_decay && !list_empty(&tmp_list)) {
> +			spin_lock(&vn->pool_lock);
> +			list_replace_init(&tmp_list, &vn->pool[i].head);
> +			spin_unlock(&vn->pool_lock);
> +		}
> +	}
> +
> +	reclaim_list_global(&decay_list);
> +}
> +
> +static void purge_vmap_node(struct work_struct *work)
> +{
> +	struct vmap_node *vn = container_of(work,
> +		struct vmap_node, purge_work);
> +	struct vmap_area *va, *n_va;
> +	LIST_HEAD(local_list);
> +
> +	vn->nr_purged = 0;
> +
>  	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
>  		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
>  		unsigned long orig_start = va->va_start;
>  		unsigned long orig_end = va->va_end;
> +		unsigned int vn_id = decode_vn_id(va->flags);
>  
> -		/*
> -		 * Finally insert or merge lazily-freed area. It is
> -		 * detached and there is no need to "unlink" it from
> -		 * anything.
> -		 */
> -		va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
> -				&free_vmap_area_list);
> -
> -		if (!va)
> -			continue;
> +		list_del_init(&va->list);
>  
>  		if (is_vmalloc_or_module_addr((void *)orig_start))
>  			kasan_release_vmalloc(orig_start, orig_end,
>  					      va->va_start, va->va_end);
>  
>  		atomic_long_sub(nr, &vmap_lazy_nr);
> -		num_purged_areas++;
> +		vn->nr_purged++;
> +
> +		if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> +			if (node_pool_add_va(vn, va))
> +				continue;
> +
> +		/* Go back to global. */
> +		list_add(&va->list, &local_list);
>  	}
> -	spin_unlock(&free_vmap_area_lock);
>  
> -	return num_purged_areas;
> +	reclaim_list_global(&local_list);
>  }
>  
>  /*
>   * Purges all lazily-freed vmap areas.
>   */
> -static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
> +		bool full_pool_decay)
>  {
> -	unsigned long num_purged_areas = 0;
> +	unsigned long nr_purged_areas = 0;
> +	unsigned int nr_purge_helpers;
> +	unsigned int nr_purge_nodes;
>  	struct vmap_node *vn;
>  	int i;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
> +
> +	/*
> +	 * Use cpumask to mark which node has to be processed.
> +	 */
>  	purge_nodes = CPU_MASK_NONE;
>  
>  	for (i = 0; i < nr_vmap_nodes; i++) {
>  		vn = &vmap_nodes[i];
>  
>  		INIT_LIST_HEAD(&vn->purge_list);
> +		vn->skip_populate = full_pool_decay;
> +		decay_va_pool_node(vn, full_pool_decay);
>  
>  		if (RB_EMPTY_ROOT(&vn->lazy.root))
>  			continue;
> @@ -1845,17 +2102,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  		cpumask_set_cpu(i, &purge_nodes);
>  	}
>  
> -	if (cpumask_weight(&purge_nodes) > 0) {
> +	nr_purge_nodes = cpumask_weight(&purge_nodes);
> +	if (nr_purge_nodes > 0) {
>  		flush_tlb_kernel_range(start, end);
>  
> +		/* One extra worker is per a lazy_max_pages() full set minus one. */
> +		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
> +		nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
> +
>  		for_each_cpu(i, &purge_nodes) {
> -			vn = &nodes[i];
> -			num_purged_areas += purge_vmap_node(vn);
> +			vn = &vmap_nodes[i];
> +
> +			if (nr_purge_helpers > 0) {
> +				INIT_WORK(&vn->purge_work, purge_vmap_node);
> +
> +				if (cpumask_test_cpu(i, cpu_online_mask))
> +					schedule_work_on(i, &vn->purge_work);
> +				else
> +					schedule_work(&vn->purge_work);
> +
> +				nr_purge_helpers--;
> +			} else {
> +				vn->purge_work.func = NULL;
> +				purge_vmap_node(&vn->purge_work);
> +				nr_purged_areas += vn->nr_purged;
> +			}
> +		}
> +
> +		for_each_cpu(i, &purge_nodes) {
> +			vn = &vmap_nodes[i];
> +
> +			if (vn->purge_work.func) {
> +				flush_work(&vn->purge_work);
> +				nr_purged_areas += vn->nr_purged;
> +			}
>  		}
>  	}
>  
> -	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
> -	return num_purged_areas > 0;
> +	trace_purge_vmap_area_lazy(start, end, nr_purged_areas);
> +	return nr_purged_areas > 0;
>  }
>  
>  /*
> @@ -1866,14 +2151,14 @@ static void reclaim_and_purge_vmap_areas(void)
>  {
>  	mutex_lock(&vmap_purge_lock);
>  	purge_fragmented_blocks_allcpus();
> -	__purge_vmap_area_lazy(ULONG_MAX, 0);
> +	__purge_vmap_area_lazy(ULONG_MAX, 0, true);
>  	mutex_unlock(&vmap_purge_lock);
>  }
>  
>  static void drain_vmap_area_work(struct work_struct *work)
>  {
>  	mutex_lock(&vmap_purge_lock);
> -	__purge_vmap_area_lazy(ULONG_MAX, 0);
> +	__purge_vmap_area_lazy(ULONG_MAX, 0, false);
>  	mutex_unlock(&vmap_purge_lock);
>  }
>  
> @@ -1884,9 +2169,10 @@ static void drain_vmap_area_work(struct work_struct *work)
>   */
>  static void free_vmap_area_noflush(struct vmap_area *va)
>  {
> -	struct vmap_node *vn = addr_to_node(va->va_start);
>  	unsigned long nr_lazy_max = lazy_max_pages();
>  	unsigned long va_start = va->va_start;
> +	unsigned int vn_id = decode_vn_id(va->flags);
> +	struct vmap_node *vn;
>  	unsigned long nr_lazy;
>  
>  	if (WARN_ON_ONCE(!list_empty(&va->list)))
> @@ -1896,10 +2182,14 @@ static void free_vmap_area_noflush(struct vmap_area *va)
>  				PAGE_SHIFT, &vmap_lazy_nr);
>  
>  	/*
> -	 * Merge or place it to the purge tree/list.
> +	 * If it was request by a certain node we would like to
> +	 * return it to that node, i.e. its pool for later reuse.
>  	 */
> +	vn = is_vn_id_valid(vn_id) ?
> +		id_to_node(vn_id):addr_to_node(va->va_start);
> +
>  	spin_lock(&vn->lazy.lock);
> -	merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
> +	insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
>  	spin_unlock(&vn->lazy.lock);
>  
>  	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
> @@ -2408,7 +2698,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
>  	}
>  	free_purged_blocks(&purge_list);
>  
> -	if (!__purge_vmap_area_lazy(start, end) && flush)
> +	if (!__purge_vmap_area_lazy(start, end, false) && flush)
>  		flush_tlb_kernel_range(start, end);
>  	mutex_unlock(&vmap_purge_lock);
>  }
> @@ -4576,7 +4866,7 @@ static void vmap_init_free_space(void)
>  static void vmap_init_nodes(void)
>  {
>  	struct vmap_node *vn;
> -	int i;
> +	int i, j;
>  
>  	for (i = 0; i < nr_vmap_nodes; i++) {
>  		vn = &vmap_nodes[i];
> @@ -4587,6 +4877,13 @@ static void vmap_init_nodes(void)
>  		vn->lazy.root = RB_ROOT;
>  		INIT_LIST_HEAD(&vn->lazy.head);
>  		spin_lock_init(&vn->lazy.lock);
> +
> +		for (j = 0; j < MAX_VA_SIZE_PAGES; j++) {
> +			INIT_LIST_HEAD(&vn->pool[j].head);
> +			WRITE_ONCE(vn->pool[j].len, 0);
> +		}
> +
> +		spin_lock_init(&vn->pool_lock);
>  	}
>  }
>  
> -- 
> 2.39.2
>

Uladzislau Rezki Feb. 8, 2024, 1:57 p.m. UTC | #10

On Thu, Feb 08, 2024 at 08:25:23AM +0800, Baoquan He wrote:
> On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
> ......
> > +static struct vmap_area *
> > +node_alloc(unsigned long size, unsigned long align,
> > +		unsigned long vstart, unsigned long vend,
> > +		unsigned long *addr, unsigned int *vn_id)
> > +{
> > +	struct vmap_area *va;
> > +
> > +	*vn_id = 0;
> > +	*addr = vend;
> > +
> > +	/*
> > +	 * Fallback to a global heap if not vmalloc or there
> > +	 * is only one node.
> > +	 */
> > +	if (vstart != VMALLOC_START || vend != VMALLOC_END ||
> > +			nr_vmap_nodes == 1)
> > +		return NULL;
> > +
> > +	*vn_id = raw_smp_processor_id() % nr_vmap_nodes;
> > +	va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend);
> > +	*vn_id = encode_vn_id(*vn_id);
> > +
> > +	if (va)
> > +		*addr = va->va_start;
> > +
> > +	return va;
> > +}
> > +
> >  /*
> >   * Allocate a region of KVA of the specified size and alignment, within the
> >   * vstart and vend.
> > @@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  	struct vmap_area *va;
> >  	unsigned long freed;
> >  	unsigned long addr;
> > +	unsigned int vn_id;
> >  	int purged = 0;
> >  	int ret;
> >  
> > @@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  		return ERR_PTR(-EBUSY);
> >  
> >  	might_sleep();
> > -	gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
> >  
> > -	va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > -	if (unlikely(!va))
> > -		return ERR_PTR(-ENOMEM);
> > +	/*
> > +	 * If a VA is obtained from a global heap(if it fails here)
> > +	 * it is anyway marked with this "vn_id" so it is returned
> > +	 * to this pool's node later. Such way gives a possibility
> > +	 * to populate pools based on users demand.
> > +	 *
> > +	 * On success a ready to go VA is returned.
> > +	 */
> > +	va = node_alloc(size, align, vstart, vend, &addr, &vn_id);
> 
> Sorry for late checking.
> 
No problem :)

> Here, if no available va got, e.g a empty vp, still we will get an
> effective vn_id with the current cpu_id for VMALLOC region allocation
> request.
> 
> > +	if (!va) {
> > +		gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
> > +
> > +		va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> > +		if (unlikely(!va))
> > +			return ERR_PTR(-ENOMEM);
> > +	}
> >  
> >  	/*
> >  	 * Only scan the relevant parts containing pointers to other objects
> > @@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
> >  
> >  retry:
> > -	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> > -	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > -		size, align, vstart, vend);
> > -	spin_unlock(&free_vmap_area_lock);
> > +	if (addr == vend) {
> > +		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> > +		addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> > +			size, align, vstart, vend);
> 
> Then, here, we will get an available va from random location, but its
> vn_id is from the current cpu.
> 
> Then in purge_vmap_node(), we will decode the vn_id stored in va->flags,
> and add the relevant va into vn->pool[] according to the vn_id. The
> worst case could be most of va in vn->pool[] are not corresponding to
> the vmap_nodes they belongs to. It doesn't matter?
> 
We do not do any "in-front" population, instead it behaves as a cache
miss when you need to access a main memmory to do a load and then keep
the data in a cache.

Same here. As a first step, for a CPU it always a miss, thus a VA is
obtained from the global heap and is marked with a current CPU that
makes an attempt to alloc. Later on that CPU/node is populated by that
marked VA. So second alloc on same CPU goes via fast path.

VAs are populated based on demand and those nodes which do allocations.

> Should we adjust the code of vn_id assigning in node_alloc(), or I missed anything?
Now it is open-coded. Some further refactoring should be done. Agree.

--
Uladzislau Rezki

Baoquan He Feb. 28, 2024, 9:48 a.m. UTC | #11

On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
.....snip...
> +static void
> +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> +{
> +	struct vmap_area *va, *nva;
> +	struct list_head decay_list;
> +	struct rb_root decay_root;
> +	unsigned long n_decay;
> +	int i;
> +
> +	decay_root = RB_ROOT;
> +	INIT_LIST_HEAD(&decay_list);
> +
> +	for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> +		struct list_head tmp_list;
> +
> +		if (list_empty(&vn->pool[i].head))
> +			continue;
> +
> +		INIT_LIST_HEAD(&tmp_list);
> +
> +		/* Detach the pool, so no-one can access it. */
> +		spin_lock(&vn->pool_lock);
> +		list_replace_init(&vn->pool[i].head, &tmp_list);
> +		spin_unlock(&vn->pool_lock);
> +
> +		if (full_decay)
> +			WRITE_ONCE(vn->pool[i].len, 0);
> +
> +		/* Decay a pool by ~25% out of left objects. */

This isn't true if the pool has less than 4 objects. If there are 3
objects, n_decay = 0.

> +		n_decay = vn->pool[i].len >> 2;
> +
> +		list_for_each_entry_safe(va, nva, &tmp_list, list) {
> +			list_del_init(&va->list);
> +			merge_or_add_vmap_area(va, &decay_root, &decay_list);
> +
> +			if (!full_decay) {
> +				WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> +
> +				if (!--n_decay)
> +					break;

                               Here, --n_decay will make n_decay 0xffffffffffffffff,
                               then all left objects are reclaimed.
> +			}
> +		}
> +
> +		/* Attach the pool back if it has been partly decayed. */
> +		if (!full_decay && !list_empty(&tmp_list)) {
> +			spin_lock(&vn->pool_lock);
> +			list_replace_init(&tmp_list, &vn->pool[i].head);
> +			spin_unlock(&vn->pool_lock);
> +		}
> +	}
> +
> +	reclaim_list_global(&decay_list);
> +}
......snip

Uladzislau Rezki Feb. 28, 2024, 10:39 a.m. UTC | #12

On Wed, Feb 28, 2024 at 05:48:53PM +0800, Baoquan He wrote:
> On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
> .....snip...
> > +static void
> > +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> > +{
> > +	struct vmap_area *va, *nva;
> > +	struct list_head decay_list;
> > +	struct rb_root decay_root;
> > +	unsigned long n_decay;
> > +	int i;
> > +
> > +	decay_root = RB_ROOT;
> > +	INIT_LIST_HEAD(&decay_list);
> > +
> > +	for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> > +		struct list_head tmp_list;
> > +
> > +		if (list_empty(&vn->pool[i].head))
> > +			continue;
> > +
> > +		INIT_LIST_HEAD(&tmp_list);
> > +
> > +		/* Detach the pool, so no-one can access it. */
> > +		spin_lock(&vn->pool_lock);
> > +		list_replace_init(&vn->pool[i].head, &tmp_list);
> > +		spin_unlock(&vn->pool_lock);
> > +
> > +		if (full_decay)
> > +			WRITE_ONCE(vn->pool[i].len, 0);
> > +
> > +		/* Decay a pool by ~25% out of left objects. */
> 
> This isn't true if the pool has less than 4 objects. If there are 3
> objects, n_decay = 0.
> 
This is expectable.

> > +		n_decay = vn->pool[i].len >> 2;
> > +
> > +		list_for_each_entry_safe(va, nva, &tmp_list, list) {
> > +			list_del_init(&va->list);
> > +			merge_or_add_vmap_area(va, &decay_root, &decay_list);
> > +
> > +			if (!full_decay) {
> > +				WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> > +
> > +				if (!--n_decay)
> > +					break;
> 
>                                Here, --n_decay will make n_decay 0xffffffffffffffff,
>                                then all left objects are reclaimed.
Right. Last three objects do not play a big game.

--
Uladzislau Rezki

Baoquan He Feb. 28, 2024, 12:26 p.m. UTC | #13

On 02/28/24 at 11:39am, Uladzislau Rezki wrote:
> On Wed, Feb 28, 2024 at 05:48:53PM +0800, Baoquan He wrote:
> > On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote:
> > .....snip...
> > > +static void
> > > +decay_va_pool_node(struct vmap_node *vn, bool full_decay)
> > > +{
> > > +	struct vmap_area *va, *nva;
> > > +	struct list_head decay_list;
> > > +	struct rb_root decay_root;
> > > +	unsigned long n_decay;
> > > +	int i;
> > > +
> > > +	decay_root = RB_ROOT;
> > > +	INIT_LIST_HEAD(&decay_list);
> > > +
> > > +	for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
> > > +		struct list_head tmp_list;
> > > +
> > > +		if (list_empty(&vn->pool[i].head))
> > > +			continue;
> > > +
> > > +		INIT_LIST_HEAD(&tmp_list);
> > > +
> > > +		/* Detach the pool, so no-one can access it. */
> > > +		spin_lock(&vn->pool_lock);
> > > +		list_replace_init(&vn->pool[i].head, &tmp_list);
> > > +		spin_unlock(&vn->pool_lock);
> > > +
> > > +		if (full_decay)
> > > +			WRITE_ONCE(vn->pool[i].len, 0);
> > > +
> > > +		/* Decay a pool by ~25% out of left objects. */
> > 
> > This isn't true if the pool has less than 4 objects. If there are 3
> > objects, n_decay = 0.
> > 
> This is expectable.


> 
> > > +		n_decay = vn->pool[i].len >> 2;
> > > +
> > > +		list_for_each_entry_safe(va, nva, &tmp_list, list) {
> > > +			list_del_init(&va->list);
> > > +			merge_or_add_vmap_area(va, &decay_root, &decay_list);
> > > +
> > > +			if (!full_decay) {
> > > +				WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1);
> > > +
> > > +				if (!--n_decay)
> > > +					break;
> > 
> >                                Here, --n_decay will make n_decay 0xffffffffffffffff,
> >                                then all left objects are reclaimed.
> Right. Last three objects do not play a big game.

See it now, thanks.

Guenter Roeck March 22, 2024, 6:21 p.m. UTC | #14

Hi,

On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> Concurrent access to a global vmap space is a bottle-neck.
> We can simulate a high contention by running a vmalloc test
> suite.
> 
> To address it, introduce an effective vmap node logic. Each
> node behaves as independent entity. When a node is accessed
> it serves a request directly(if possible) from its pool.
> 
> This model has a size based pool for requests, i.e. pools are
> serialized and populated based on object size and real demand.
> A maximum object size that pool can handle is set to 256 pages.
> 
> This technique reduces a pressure on the global vmap lock.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

This patch results in a persistent "spinlock bad magic" message
when booting s390 images with spinlock debugging enabled.

[    0.465445] BUG: spinlock bad magic on CPU#0, swapper/0
[    0.465490]  lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[    0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[    0.466270] Call Trace:
[    0.466470]  [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8
[    0.466516]  [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108
[    0.466545]  [<000000000042146c>] find_vmap_area+0x6c/0x108
[    0.466572]  [<000000000042175a>] find_vm_area+0x22/0x40
[    0.466597]  [<000000000012f152>] __set_memory+0x132/0x150
[    0.466624]  [<0000000001cc0398>] vmem_map_init+0x40/0x118
[    0.466651]  [<0000000001cc0092>] paging_init+0x22/0x68
[    0.466677]  [<0000000001cbbed2>] setup_arch+0x52a/0x708
[    0.466702]  [<0000000001cb6140>] start_kernel+0x80/0x5c8
[    0.466727]  [<0000000000100036>] startup_continue+0x36/0x40

Bisect results and decoded stacktrace below.

The uninitialized spinlock is &vn->busy.lock.
Debugging shows that this lock is actually never initialized.

[    0.464684] ####### locking 0000000002280fb8
[    0.464862] BUG: spinlock bad magic on CPU#0, swapper/0
...
[    0.464684] ####### locking 0000000002280fb8
[    0.477479] ####### locking 0000000002280fb8
[    0.478166] ####### locking 0000000002280fb8
[    0.478218] ####### locking 0000000002280fb8
...
[    0.718250] #### busy lock init 0000000002871860
[    0.718328] #### busy lock init 00000000028731b8

Only the initialized locks are used after the call to vmap_init_nodes().

Guenter

---
# bad: [8e938e39866920ddc266898e6ae1fffc5c8f51aa] Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
# good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8
git bisect start 'HEAD' 'v6.8'
# good: [e56bc745fa1de77abc2ad8debc4b1b83e0426c49] smb311: additional compression flag defined in updated protocol spec
git bisect good e56bc745fa1de77abc2ad8debc4b1b83e0426c49
# bad: [902861e34c401696ed9ad17a54c8790e7e8e3069] Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 902861e34c401696ed9ad17a54c8790e7e8e3069
# good: [480e035fc4c714fb5536e64ab9db04fedc89e910] Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel
git bisect good 480e035fc4c714fb5536e64ab9db04fedc89e910
# good: [fe46a7dd189e25604716c03576d05ac8a5209743] Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good fe46a7dd189e25604716c03576d05ac8a5209743
# bad: [435a75548109f19e5b5b14ae35b9acb063c084e9] mm: use folio more widely in __split_huge_page
git bisect bad 435a75548109f19e5b5b14ae35b9acb063c084e9
# good: [4d5bf0b6183f79ea361dd506365d2a471270735c] mm/mmu_gather: add tlb_remove_tlb_entries()
git bisect good 4d5bf0b6183f79ea361dd506365d2a471270735c
# bad: [4daacfe8f99f4b4cef562649d56c48642981f46e] mm/damon/sysfs-schemes: support PSI-based quota auto-tune
git bisect bad 4daacfe8f99f4b4cef562649d56c48642981f46e
# good: [217b2119b9e260609958db413876f211038f00ee] mm,page_owner: implement the tracking of the stacks count
git bisect good 217b2119b9e260609958db413876f211038f00ee
# bad: [40254101d87870b2e5ac3ddc28af40aa04c48486] arm64, crash: wrap crash dumping code into crash related ifdefs
git bisect bad 40254101d87870b2e5ac3ddc28af40aa04c48486
# bad: [53becf32aec1c8049b854f0c31a11df5ed75df6f] mm: vmalloc: support multiple nodes in vread_iter
git bisect bad 53becf32aec1c8049b854f0c31a11df5ed75df6f
# good: [7fa8cee003166ef6db0bba70d610dbf173543811] mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
git bisect good 7fa8cee003166ef6db0bba70d610dbf173543811
# good: [282631cb2447318e2a55b41a665dbe8571c46d70] mm: vmalloc: remove global purge_vmap_area_root rb-tree
git bisect good 282631cb2447318e2a55b41a665dbe8571c46d70
# bad: [96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e] mm: vmalloc: add a scan area of VA only once
git bisect bad 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e
# bad: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock
git bisect bad 72210662c5a2b6005f6daea7fe293a0dc573e1a5
# first bad commit: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock

---
[    0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[    0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[    0.466270] Call Trace:
[    0.466470] dump_stack_lvl (lib/dump_stack.c:117)
[    0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115)
[    0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364)
[    0.466572] find_vm_area (mm/vmalloc.c:3150)
[    0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393)
[    0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660)
[    0.466651] paging_init (arch/s390/mm/init.c:97)
[    0.466677] setup_arch (arch/s390/kernel/setup.c:972)
[    0.466702] start_kernel (init/main.c:899)
[    0.466727] startup_continue (arch/s390/kernel/head64.S:35)
[    0.466811] INFO: lockdep is turned off.

Uladzislau Rezki March 22, 2024, 7:03 p.m. UTC | #15

On Fri, Mar 22, 2024 at 11:21:02AM -0700, Guenter Roeck wrote:
> Hi,
> 
> On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote:
> > Concurrent access to a global vmap space is a bottle-neck.
> > We can simulate a high contention by running a vmalloc test
> > suite.
> > 
> > To address it, introduce an effective vmap node logic. Each
> > node behaves as independent entity. When a node is accessed
> > it serves a request directly(if possible) from its pool.
> > 
> > This model has a size based pool for requests, i.e. pools are
> > serialized and populated based on object size and real demand.
> > A maximum object size that pool can handle is set to 256 pages.
> > 
> > This technique reduces a pressure on the global vmap lock.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> This patch results in a persistent "spinlock bad magic" message
> when booting s390 images with spinlock debugging enabled.
> 
> [    0.465445] BUG: spinlock bad magic on CPU#0, swapper/0
> [    0.465490]  lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> [    0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
> [    0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
> [    0.466270] Call Trace:
> [    0.466470]  [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8
> [    0.466516]  [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108
> [    0.466545]  [<000000000042146c>] find_vmap_area+0x6c/0x108
> [    0.466572]  [<000000000042175a>] find_vm_area+0x22/0x40
> [    0.466597]  [<000000000012f152>] __set_memory+0x132/0x150
> [    0.466624]  [<0000000001cc0398>] vmem_map_init+0x40/0x118
> [    0.466651]  [<0000000001cc0092>] paging_init+0x22/0x68
> [    0.466677]  [<0000000001cbbed2>] setup_arch+0x52a/0x708
> [    0.466702]  [<0000000001cb6140>] start_kernel+0x80/0x5c8
> [    0.466727]  [<0000000000100036>] startup_continue+0x36/0x40
> 
> Bisect results and decoded stacktrace below.
> 
> The uninitialized spinlock is &vn->busy.lock.
> Debugging shows that this lock is actually never initialized.
> 
It is. Once the vmalloc_init() "main entry" function is called from the:

<snip>
start_kernel()
  mm_core_init()
    vmalloc_init()
<snip>

> [    0.464684] ####### locking 0000000002280fb8
> [    0.464862] BUG: spinlock bad magic on CPU#0, swapper/0
> ...
> [    0.464684] ####### locking 0000000002280fb8
> [    0.477479] ####### locking 0000000002280fb8
> [    0.478166] ####### locking 0000000002280fb8
> [    0.478218] ####### locking 0000000002280fb8
> ...
> [    0.718250] #### busy lock init 0000000002871860
> [    0.718328] #### busy lock init 00000000028731b8
> 
> Only the initialized locks are used after the call to vmap_init_nodes().
> 
Right, when the vmap space and vmalloc is initialized.

> Guenter
> 
> ---
> # bad: [8e938e39866920ddc266898e6ae1fffc5c8f51aa] Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
> # good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8
> git bisect start 'HEAD' 'v6.8'
> # good: [e56bc745fa1de77abc2ad8debc4b1b83e0426c49] smb311: additional compression flag defined in updated protocol spec
> git bisect good e56bc745fa1de77abc2ad8debc4b1b83e0426c49
> # bad: [902861e34c401696ed9ad17a54c8790e7e8e3069] Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect bad 902861e34c401696ed9ad17a54c8790e7e8e3069
> # good: [480e035fc4c714fb5536e64ab9db04fedc89e910] Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel
> git bisect good 480e035fc4c714fb5536e64ab9db04fedc89e910
> # good: [fe46a7dd189e25604716c03576d05ac8a5209743] Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> git bisect good fe46a7dd189e25604716c03576d05ac8a5209743
> # bad: [435a75548109f19e5b5b14ae35b9acb063c084e9] mm: use folio more widely in __split_huge_page
> git bisect bad 435a75548109f19e5b5b14ae35b9acb063c084e9
> # good: [4d5bf0b6183f79ea361dd506365d2a471270735c] mm/mmu_gather: add tlb_remove_tlb_entries()
> git bisect good 4d5bf0b6183f79ea361dd506365d2a471270735c
> # bad: [4daacfe8f99f4b4cef562649d56c48642981f46e] mm/damon/sysfs-schemes: support PSI-based quota auto-tune
> git bisect bad 4daacfe8f99f4b4cef562649d56c48642981f46e
> # good: [217b2119b9e260609958db413876f211038f00ee] mm,page_owner: implement the tracking of the stacks count
> git bisect good 217b2119b9e260609958db413876f211038f00ee
> # bad: [40254101d87870b2e5ac3ddc28af40aa04c48486] arm64, crash: wrap crash dumping code into crash related ifdefs
> git bisect bad 40254101d87870b2e5ac3ddc28af40aa04c48486
> # bad: [53becf32aec1c8049b854f0c31a11df5ed75df6f] mm: vmalloc: support multiple nodes in vread_iter
> git bisect bad 53becf32aec1c8049b854f0c31a11df5ed75df6f
> # good: [7fa8cee003166ef6db0bba70d610dbf173543811] mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
> git bisect good 7fa8cee003166ef6db0bba70d610dbf173543811
> # good: [282631cb2447318e2a55b41a665dbe8571c46d70] mm: vmalloc: remove global purge_vmap_area_root rb-tree
> git bisect good 282631cb2447318e2a55b41a665dbe8571c46d70
> # bad: [96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e] mm: vmalloc: add a scan area of VA only once
> git bisect bad 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e
> # bad: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock
> git bisect bad 72210662c5a2b6005f6daea7fe293a0dc573e1a5
> # first bad commit: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock
> 
> ---
> [    0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> [    0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
> [    0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
> [    0.466270] Call Trace:
> [    0.466470] dump_stack_lvl (lib/dump_stack.c:117)
> [    0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115)
> [    0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364)
> [    0.466572] find_vm_area (mm/vmalloc.c:3150)
> [    0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393)
> [    0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660)
> [    0.466651] paging_init (arch/s390/mm/init.c:97)
> [    0.466677] setup_arch (arch/s390/kernel/setup.c:972)
> [    0.466702] start_kernel (init/main.c:899)
> [    0.466727] startup_continue (arch/s390/kernel/head64.S:35)
> [    0.466811] INFO: lockdep is turned off.
> 
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 22aa63f4ef63..0d77d171b5d9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
        struct vmap_area *va;
        int i, j;

+       if (unlikely(!vmap_initialized))
+               return NULL;
+
        /*
         * An addr_to_node_id(addr) converts an address to a node index
         * where a VA is located. If VA spans several zones and passed
<snip>

Could you please test it?

--
Uladzislau Rezki

Guenter Roeck March 22, 2024, 8:53 p.m. UTC | #16

On 3/22/24 12:03, Uladzislau Rezki wrote:
[ ... ]

> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 22aa63f4ef63..0d77d171b5d9 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
>          struct vmap_area *va;
>          int i, j;
> 
> +       if (unlikely(!vmap_initialized))
> +               return NULL;
> +
>          /*
>           * An addr_to_node_id(addr) converts an address to a node index
>           * where a VA is located. If VA spans several zones and passed
> <snip>
> 
> Could you please test it?
> 

That fixes the problem.

Thanks,
Guenter

[v3,07/11] mm: vmalloc: Offload free_vmap_area_lock lock

Commit Message

Comments

Patch