Message ID | 20240102184633.748113-8-urezki@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Mitigate a vmap lock contention v3 | expand |
On Tue, 2 Jan 2024 19:46:29 +0100 Uladzislau Rezki <urezki@gmail.com> > +static void > +decay_va_pool_node(struct vmap_node *vn, bool full_decay) > +{ > + struct vmap_area *va, *nva; > + struct list_head decay_list; > + struct rb_root decay_root; > + unsigned long n_decay; > + int i; > + > + decay_root = RB_ROOT; > + INIT_LIST_HEAD(&decay_list); > + > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > + struct list_head tmp_list; > + > + if (list_empty(&vn->pool[i].head)) > + continue; > + > + INIT_LIST_HEAD(&tmp_list); > + > + /* Detach the pool, so no-one can access it. */ > + spin_lock(&vn->pool_lock); > + list_replace_init(&vn->pool[i].head, &tmp_list); > + spin_unlock(&vn->pool_lock); > + > + if (full_decay) > + WRITE_ONCE(vn->pool[i].len, 0); > + > + /* Decay a pool by ~25% out of left objects. */ > + n_decay = vn->pool[i].len >> 2; > + > + list_for_each_entry_safe(va, nva, &tmp_list, list) { > + list_del_init(&va->list); > + merge_or_add_vmap_area(va, &decay_root, &decay_list); > + > + if (!full_decay) { > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); > + > + if (!--n_decay) > + break; > + } > + } > + > + /* Attach the pool back if it has been partly decayed. */ > + if (!full_decay && !list_empty(&tmp_list)) { > + spin_lock(&vn->pool_lock); > + list_replace_init(&tmp_list, &vn->pool[i].head); > + spin_unlock(&vn->pool_lock); > + } Failure of working out why list_splice() was not used here in case of non-empty vn->pool[i].head, after staring ten minutes. > + } > + > + reclaim_list_global(&decay_list); > +}
On Wed, Jan 03, 2024 at 07:08:32PM +0800, Hillf Danton wrote: > On Tue, 2 Jan 2024 19:46:29 +0100 Uladzislau Rezki <urezki@gmail.com> > > +static void > > +decay_va_pool_node(struct vmap_node *vn, bool full_decay) > > +{ > > + struct vmap_area *va, *nva; > > + struct list_head decay_list; > > + struct rb_root decay_root; > > + unsigned long n_decay; > > + int i; > > + > > + decay_root = RB_ROOT; > > + INIT_LIST_HEAD(&decay_list); > > + > > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > > + struct list_head tmp_list; > > + > > + if (list_empty(&vn->pool[i].head)) > > + continue; > > + > > + INIT_LIST_HEAD(&tmp_list); > > + > > + /* Detach the pool, so no-one can access it. */ > > + spin_lock(&vn->pool_lock); > > + list_replace_init(&vn->pool[i].head, &tmp_list); > > + spin_unlock(&vn->pool_lock); > > + > > + if (full_decay) > > + WRITE_ONCE(vn->pool[i].len, 0); > > + > > + /* Decay a pool by ~25% out of left objects. */ > > + n_decay = vn->pool[i].len >> 2; > > + > > + list_for_each_entry_safe(va, nva, &tmp_list, list) { > > + list_del_init(&va->list); > > + merge_or_add_vmap_area(va, &decay_root, &decay_list); > > + > > + if (!full_decay) { > > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); > > + > > + if (!--n_decay) > > + break; > > + } > > + } > > + > > + /* Attach the pool back if it has been partly decayed. */ > > + if (!full_decay && !list_empty(&tmp_list)) { > > + spin_lock(&vn->pool_lock); > > + list_replace_init(&tmp_list, &vn->pool[i].head); > > + spin_unlock(&vn->pool_lock); > > + } > > Failure of working out why list_splice() was not used here in case of > non-empty vn->pool[i].head, after staring ten minutes. > The vn->pool[i].head is always empty here because we have detached it above and initialized. Concurrent decay and populate also is not possible because both is done by only one context. -- Uladzislau Rezki
On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > Concurrent access to a global vmap space is a bottle-neck. > We can simulate a high contention by running a vmalloc test > suite. > > To address it, introduce an effective vmap node logic. Each > node behaves as independent entity. When a node is accessed > it serves a request directly(if possible) from its pool. > > This model has a size based pool for requests, i.e. pools are > serialized and populated based on object size and real demand. > A maximum object size that pool can handle is set to 256 pages. > > This technique reduces a pressure on the global vmap lock. > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Why not use a llist for this? That gets rid of the need for a new pool_lock altogether... Cheers, Dave.
On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote: > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > > Concurrent access to a global vmap space is a bottle-neck. > > We can simulate a high contention by running a vmalloc test > > suite. > > > > To address it, introduce an effective vmap node logic. Each > > node behaves as independent entity. When a node is accessed > > it serves a request directly(if possible) from its pool. > > > > This model has a size based pool for requests, i.e. pools are > > serialized and populated based on object size and real demand. > > A maximum object size that pool can handle is set to 256 pages. > > > > This technique reduces a pressure on the global vmap lock. > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Why not use a llist for this? That gets rid of the need for a > new pool_lock altogether... > Initially i used the llist. I have changed it because i keep track of objects per a pool to decay it later. I do not find these locks as contented one therefore i did not think much. Anyway, i will have a look at this to see if llist is easy to go with or not. If so i will send out a separate patch. Thanks! -- Uladzislau Rezki
On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote: > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote: > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > > > Concurrent access to a global vmap space is a bottle-neck. > > > We can simulate a high contention by running a vmalloc test > > > suite. > > > > > > To address it, introduce an effective vmap node logic. Each > > > node behaves as independent entity. When a node is accessed > > > it serves a request directly(if possible) from its pool. > > > > > > This model has a size based pool for requests, i.e. pools are > > > serialized and populated based on object size and real demand. > > > A maximum object size that pool can handle is set to 256 pages. > > > > > > This technique reduces a pressure on the global vmap lock. > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > Why not use a llist for this? That gets rid of the need for a > > new pool_lock altogether... > > > Initially i used the llist. I have changed it because i keep track > of objects per a pool to decay it later. I do not find these locks > as contented one therefore i did not think much. Ok. I've used llist and an atomic counter to track the list length in the past. But is the list length even necessary? It seems to me that it is only used by the shrinker to determine how many objects are on the lists for scanning, and I'm not sure that's entirely necessary given the way the current global shrinker works (i.e. completely unfair to low numbered nodes due to scan loop start bias). > Anyway, i will have a look at this to see if llist is easy to go with > or not. If so i will send out a separate patch. Sounds good, it was just something that crossed my mind given the pattern of "producer adds single items, consumer detaches entire list, processes it and reattaches remainder" is a perfect match for the llist structure. Cheers, Dave.
On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote: > On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote: > > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote: > > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > > > > Concurrent access to a global vmap space is a bottle-neck. > > > > We can simulate a high contention by running a vmalloc test > > > > suite. > > > > > > > > To address it, introduce an effective vmap node logic. Each > > > > node behaves as independent entity. When a node is accessed > > > > it serves a request directly(if possible) from its pool. > > > > > > > > This model has a size based pool for requests, i.e. pools are > > > > serialized and populated based on object size and real demand. > > > > A maximum object size that pool can handle is set to 256 pages. > > > > > > > > This technique reduces a pressure on the global vmap lock. > > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > > > Why not use a llist for this? That gets rid of the need for a > > > new pool_lock altogether... > > > > > Initially i used the llist. I have changed it because i keep track > > of objects per a pool to decay it later. I do not find these locks > > as contented one therefore i did not think much. > > Ok. I've used llist and an atomic counter to track the list length > in the past. > > But is the list length even necessary? It seems to me that it is > only used by the shrinker to determine how many objects are on the > lists for scanning, and I'm not sure that's entirely necessary given > the way the current global shrinker works (i.e. completely unfair to > low numbered nodes due to scan loop start bias). > I use the length to decay pools by certain percentage, currently it is 25%, so i need to know number of objects. It is done in the purge path. As for shrinker, once it hits us we drain pools entirely. > > Anyway, i will have a look at this to see if llist is easy to go with > > or not. If so i will send out a separate patch. > > Sounds good, it was just something that crossed my mind given the > pattern of "producer adds single items, consumer detaches entire > list, processes it and reattaches remainder" is a perfect match for > the llist structure. > The llist_del_first() has to be serialized. For this purpose a per-cpu pool would work or kind of "in_use" atomic that protects concurrent removing. If we detach entire llist, then we need to keep track of last node to add it later as a "batch" to already existing/populated list. Thanks four looking! -- Uladzislau Rezki
On Fri, Jan 12, 2024 at 01:18:27PM +0100, Uladzislau Rezki wrote: > On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote: > > On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote: > > > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote: > > > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > > > > > Concurrent access to a global vmap space is a bottle-neck. > > > > > We can simulate a high contention by running a vmalloc test > > > > > suite. > > > > > > > > > > To address it, introduce an effective vmap node logic. Each > > > > > node behaves as independent entity. When a node is accessed > > > > > it serves a request directly(if possible) from its pool. > > > > > > > > > > This model has a size based pool for requests, i.e. pools are > > > > > serialized and populated based on object size and real demand. > > > > > A maximum object size that pool can handle is set to 256 pages. > > > > > > > > > > This technique reduces a pressure on the global vmap lock. > > > > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > > > > > Why not use a llist for this? That gets rid of the need for a > > > > new pool_lock altogether... > > > > > > > Initially i used the llist. I have changed it because i keep track > > > of objects per a pool to decay it later. I do not find these locks > > > as contented one therefore i did not think much. > > > > Ok. I've used llist and an atomic counter to track the list length > > in the past. > > > > But is the list length even necessary? It seems to me that it is > > only used by the shrinker to determine how many objects are on the > > lists for scanning, and I'm not sure that's entirely necessary given > > the way the current global shrinker works (i.e. completely unfair to > > low numbered nodes due to scan loop start bias). > > > I use the length to decay pools by certain percentage, currently it is > 25%, so i need to know number of objects. It is done in the purge path. > As for shrinker, once it hits us we drain pools entirely. Why does purge need to be different to shrinking? But, regardless, you can still use llist with an atomic counter to do this - there is no need for a spin lock at all. > > > Anyway, i will have a look at this to see if llist is easy to go with > > > or not. If so i will send out a separate patch. > > > > Sounds good, it was just something that crossed my mind given the > > pattern of "producer adds single items, consumer detaches entire > > list, processes it and reattaches remainder" is a perfect match for > > the llist structure. > > > The llist_del_first() has to be serialized. For this purpose a per-cpu > pool would work or kind of "in_use" atomic that protects concurrent > removing. So don't use llist_del_first(). > If we detach entire llist, then we need to keep track of last node > to add it later as a "batch" to already existing/populated list. Why? I haven't see any need for ordering these lists which would requiring strict tail-add ordered semantics. Cheers, Dave.
On Wed, Jan 17, 2024 at 09:12:26AM +1100, Dave Chinner wrote: > On Fri, Jan 12, 2024 at 01:18:27PM +0100, Uladzislau Rezki wrote: > > On Fri, Jan 12, 2024 at 07:37:36AM +1100, Dave Chinner wrote: > > > On Thu, Jan 11, 2024 at 04:54:48PM +0100, Uladzislau Rezki wrote: > > > > On Thu, Jan 11, 2024 at 08:02:16PM +1100, Dave Chinner wrote: > > > > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > > > > > > Concurrent access to a global vmap space is a bottle-neck. > > > > > > We can simulate a high contention by running a vmalloc test > > > > > > suite. > > > > > > > > > > > > To address it, introduce an effective vmap node logic. Each > > > > > > node behaves as independent entity. When a node is accessed > > > > > > it serves a request directly(if possible) from its pool. > > > > > > > > > > > > This model has a size based pool for requests, i.e. pools are > > > > > > serialized and populated based on object size and real demand. > > > > > > A maximum object size that pool can handle is set to 256 pages. > > > > > > > > > > > > This technique reduces a pressure on the global vmap lock. > > > > > > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > > > > > > > Why not use a llist for this? That gets rid of the need for a > > > > > new pool_lock altogether... > > > > > > > > > Initially i used the llist. I have changed it because i keep track > > > > of objects per a pool to decay it later. I do not find these locks > > > > as contented one therefore i did not think much. > > > > > > Ok. I've used llist and an atomic counter to track the list length > > > in the past. > > > > > > But is the list length even necessary? It seems to me that it is > > > only used by the shrinker to determine how many objects are on the > > > lists for scanning, and I'm not sure that's entirely necessary given > > > the way the current global shrinker works (i.e. completely unfair to > > > low numbered nodes due to scan loop start bias). > > > > > I use the length to decay pools by certain percentage, currently it is > > 25%, so i need to know number of objects. It is done in the purge path. > > As for shrinker, once it hits us we drain pools entirely. > > Why does purge need to be different to shrinking? > > But, regardless, you can still use llist with an atomic counter to > do this - there is no need for a spin lock at all. > As i pointed earlier, i will have a look at it. > > > > Anyway, i will have a look at this to see if llist is easy to go with > > > > or not. If so i will send out a separate patch. > > > > > > Sounds good, it was just something that crossed my mind given the > > > pattern of "producer adds single items, consumer detaches entire > > > list, processes it and reattaches remainder" is a perfect match for > > > the llist structure. > > > > > The llist_del_first() has to be serialized. For this purpose a per-cpu > > pool would work or kind of "in_use" atomic that protects concurrent > > removing. > > So don't use llist_del_first(). > > > If we detach entire llist, then we need to keep track of last node > > to add it later as a "batch" to already existing/populated list. > > Why? I haven't see any need for ordering these lists which would > requiring strict tail-add ordered semantics. > I mean the following: 1. first = llist_del_all(&example); 2. last = llist_reverse_order(first); 4. va = __llist_del_first(first); /* * "example" might not be empty, use the batch. Otherwise * we loose the entries "example" pointed to. */ 3. llist_add_batch(first, last, &example); -- Uladzislau Rezki
On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote: ...... > +static struct vmap_area * > +node_alloc(unsigned long size, unsigned long align, > + unsigned long vstart, unsigned long vend, > + unsigned long *addr, unsigned int *vn_id) > +{ > + struct vmap_area *va; > + > + *vn_id = 0; > + *addr = vend; > + > + /* > + * Fallback to a global heap if not vmalloc or there > + * is only one node. > + */ > + if (vstart != VMALLOC_START || vend != VMALLOC_END || > + nr_vmap_nodes == 1) > + return NULL; > + > + *vn_id = raw_smp_processor_id() % nr_vmap_nodes; > + va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend); > + *vn_id = encode_vn_id(*vn_id); > + > + if (va) > + *addr = va->va_start; > + > + return va; > +} > + > /* > * Allocate a region of KVA of the specified size and alignment, within the > * vstart and vend. > @@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > struct vmap_area *va; > unsigned long freed; > unsigned long addr; > + unsigned int vn_id; > int purged = 0; > int ret; > > @@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > return ERR_PTR(-EBUSY); > > might_sleep(); > - gfp_mask = gfp_mask & GFP_RECLAIM_MASK; > > - va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); > - if (unlikely(!va)) > - return ERR_PTR(-ENOMEM); > + /* > + * If a VA is obtained from a global heap(if it fails here) > + * it is anyway marked with this "vn_id" so it is returned > + * to this pool's node later. Such way gives a possibility > + * to populate pools based on users demand. > + * > + * On success a ready to go VA is returned. > + */ > + va = node_alloc(size, align, vstart, vend, &addr, &vn_id); Sorry for late checking. Here, if no available va got, e.g a empty vp, still we will get an effective vn_id with the current cpu_id for VMALLOC region allocation request. > + if (!va) { > + gfp_mask = gfp_mask & GFP_RECLAIM_MASK; > + > + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); > + if (unlikely(!va)) > + return ERR_PTR(-ENOMEM); > + } > > /* > * Only scan the relevant parts containing pointers to other objects > @@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); > > retry: > - preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); > - addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, > - size, align, vstart, vend); > - spin_unlock(&free_vmap_area_lock); > + if (addr == vend) { > + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); > + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, > + size, align, vstart, vend); Then, here, we will get an available va from random location, but its vn_id is from the current cpu. Then in purge_vmap_node(), we will decode the vn_id stored in va->flags, and add the relevant va into vn->pool[] according to the vn_id. The worst case could be most of va in vn->pool[] are not corresponding to the vmap_nodes they belongs to. It doesn't matter? Should we adjust the code of vn_id assigning in node_alloc(), or I missed anything? > + spin_unlock(&free_vmap_area_lock); > + } > > trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend); > > @@ -1677,7 +1862,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > va->va_start = addr; > va->va_end = addr + size; > va->vm = NULL; > - va->flags = va_flags; > + va->flags = (va_flags | vn_id); > > vn = addr_to_node(va->va_start); > > @@ -1770,63 +1955,135 @@ static DEFINE_MUTEX(vmap_purge_lock); > static void purge_fragmented_blocks_allcpus(void); > static cpumask_t purge_nodes; > > -/* > - * Purges all lazily-freed vmap areas. > - */ > -static unsigned long > -purge_vmap_node(struct vmap_node *vn) > +static void > +reclaim_list_global(struct list_head *head) > { > - unsigned long num_purged_areas = 0; > - struct vmap_area *va, *n_va; > + struct vmap_area *va, *n; > > - if (list_empty(&vn->purge_list)) > - return 0; > + if (list_empty(head)) > + return; > > spin_lock(&free_vmap_area_lock); > + list_for_each_entry_safe(va, n, head, list) > + merge_or_add_vmap_area_augment(va, > + &free_vmap_area_root, &free_vmap_area_list); > + spin_unlock(&free_vmap_area_lock); > +} > + > +static void > +decay_va_pool_node(struct vmap_node *vn, bool full_decay) > +{ > + struct vmap_area *va, *nva; > + struct list_head decay_list; > + struct rb_root decay_root; > + unsigned long n_decay; > + int i; > + > + decay_root = RB_ROOT; > + INIT_LIST_HEAD(&decay_list); > + > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > + struct list_head tmp_list; > + > + if (list_empty(&vn->pool[i].head)) > + continue; > + > + INIT_LIST_HEAD(&tmp_list); > + > + /* Detach the pool, so no-one can access it. */ > + spin_lock(&vn->pool_lock); > + list_replace_init(&vn->pool[i].head, &tmp_list); > + spin_unlock(&vn->pool_lock); > + > + if (full_decay) > + WRITE_ONCE(vn->pool[i].len, 0); > + > + /* Decay a pool by ~25% out of left objects. */ > + n_decay = vn->pool[i].len >> 2; > + > + list_for_each_entry_safe(va, nva, &tmp_list, list) { > + list_del_init(&va->list); > + merge_or_add_vmap_area(va, &decay_root, &decay_list); > + > + if (!full_decay) { > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); > + > + if (!--n_decay) > + break; > + } > + } > + > + /* Attach the pool back if it has been partly decayed. */ > + if (!full_decay && !list_empty(&tmp_list)) { > + spin_lock(&vn->pool_lock); > + list_replace_init(&tmp_list, &vn->pool[i].head); > + spin_unlock(&vn->pool_lock); > + } > + } > + > + reclaim_list_global(&decay_list); > +} > + > +static void purge_vmap_node(struct work_struct *work) > +{ > + struct vmap_node *vn = container_of(work, > + struct vmap_node, purge_work); > + struct vmap_area *va, *n_va; > + LIST_HEAD(local_list); > + > + vn->nr_purged = 0; > + > list_for_each_entry_safe(va, n_va, &vn->purge_list, list) { > unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; > unsigned long orig_start = va->va_start; > unsigned long orig_end = va->va_end; > + unsigned int vn_id = decode_vn_id(va->flags); > > - /* > - * Finally insert or merge lazily-freed area. It is > - * detached and there is no need to "unlink" it from > - * anything. > - */ > - va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root, > - &free_vmap_area_list); > - > - if (!va) > - continue; > + list_del_init(&va->list); > > if (is_vmalloc_or_module_addr((void *)orig_start)) > kasan_release_vmalloc(orig_start, orig_end, > va->va_start, va->va_end); > > atomic_long_sub(nr, &vmap_lazy_nr); > - num_purged_areas++; > + vn->nr_purged++; > + > + if (is_vn_id_valid(vn_id) && !vn->skip_populate) > + if (node_pool_add_va(vn, va)) > + continue; > + > + /* Go back to global. */ > + list_add(&va->list, &local_list); > } > - spin_unlock(&free_vmap_area_lock); > > - return num_purged_areas; > + reclaim_list_global(&local_list); > } > > /* > * Purges all lazily-freed vmap areas. > */ > -static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, > + bool full_pool_decay) > { > - unsigned long num_purged_areas = 0; > + unsigned long nr_purged_areas = 0; > + unsigned int nr_purge_helpers; > + unsigned int nr_purge_nodes; > struct vmap_node *vn; > int i; > > lockdep_assert_held(&vmap_purge_lock); > + > + /* > + * Use cpumask to mark which node has to be processed. > + */ > purge_nodes = CPU_MASK_NONE; > > for (i = 0; i < nr_vmap_nodes; i++) { > vn = &vmap_nodes[i]; > > INIT_LIST_HEAD(&vn->purge_list); > + vn->skip_populate = full_pool_decay; > + decay_va_pool_node(vn, full_pool_decay); > > if (RB_EMPTY_ROOT(&vn->lazy.root)) > continue; > @@ -1845,17 +2102,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) > cpumask_set_cpu(i, &purge_nodes); > } > > - if (cpumask_weight(&purge_nodes) > 0) { > + nr_purge_nodes = cpumask_weight(&purge_nodes); > + if (nr_purge_nodes > 0) { > flush_tlb_kernel_range(start, end); > > + /* One extra worker is per a lazy_max_pages() full set minus one. */ > + nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); > + nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1; > + > for_each_cpu(i, &purge_nodes) { > - vn = &nodes[i]; > - num_purged_areas += purge_vmap_node(vn); > + vn = &vmap_nodes[i]; > + > + if (nr_purge_helpers > 0) { > + INIT_WORK(&vn->purge_work, purge_vmap_node); > + > + if (cpumask_test_cpu(i, cpu_online_mask)) > + schedule_work_on(i, &vn->purge_work); > + else > + schedule_work(&vn->purge_work); > + > + nr_purge_helpers--; > + } else { > + vn->purge_work.func = NULL; > + purge_vmap_node(&vn->purge_work); > + nr_purged_areas += vn->nr_purged; > + } > + } > + > + for_each_cpu(i, &purge_nodes) { > + vn = &vmap_nodes[i]; > + > + if (vn->purge_work.func) { > + flush_work(&vn->purge_work); > + nr_purged_areas += vn->nr_purged; > + } > } > } > > - trace_purge_vmap_area_lazy(start, end, num_purged_areas); > - return num_purged_areas > 0; > + trace_purge_vmap_area_lazy(start, end, nr_purged_areas); > + return nr_purged_areas > 0; > } > > /* > @@ -1866,14 +2151,14 @@ static void reclaim_and_purge_vmap_areas(void) > { > mutex_lock(&vmap_purge_lock); > purge_fragmented_blocks_allcpus(); > - __purge_vmap_area_lazy(ULONG_MAX, 0); > + __purge_vmap_area_lazy(ULONG_MAX, 0, true); > mutex_unlock(&vmap_purge_lock); > } > > static void drain_vmap_area_work(struct work_struct *work) > { > mutex_lock(&vmap_purge_lock); > - __purge_vmap_area_lazy(ULONG_MAX, 0); > + __purge_vmap_area_lazy(ULONG_MAX, 0, false); > mutex_unlock(&vmap_purge_lock); > } > > @@ -1884,9 +2169,10 @@ static void drain_vmap_area_work(struct work_struct *work) > */ > static void free_vmap_area_noflush(struct vmap_area *va) > { > - struct vmap_node *vn = addr_to_node(va->va_start); > unsigned long nr_lazy_max = lazy_max_pages(); > unsigned long va_start = va->va_start; > + unsigned int vn_id = decode_vn_id(va->flags); > + struct vmap_node *vn; > unsigned long nr_lazy; > > if (WARN_ON_ONCE(!list_empty(&va->list))) > @@ -1896,10 +2182,14 @@ static void free_vmap_area_noflush(struct vmap_area *va) > PAGE_SHIFT, &vmap_lazy_nr); > > /* > - * Merge or place it to the purge tree/list. > + * If it was request by a certain node we would like to > + * return it to that node, i.e. its pool for later reuse. > */ > + vn = is_vn_id_valid(vn_id) ? > + id_to_node(vn_id):addr_to_node(va->va_start); > + > spin_lock(&vn->lazy.lock); > - merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head); > + insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head); > spin_unlock(&vn->lazy.lock); > > trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max); > @@ -2408,7 +2698,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) > } > free_purged_blocks(&purge_list); > > - if (!__purge_vmap_area_lazy(start, end) && flush) > + if (!__purge_vmap_area_lazy(start, end, false) && flush) > flush_tlb_kernel_range(start, end); > mutex_unlock(&vmap_purge_lock); > } > @@ -4576,7 +4866,7 @@ static void vmap_init_free_space(void) > static void vmap_init_nodes(void) > { > struct vmap_node *vn; > - int i; > + int i, j; > > for (i = 0; i < nr_vmap_nodes; i++) { > vn = &vmap_nodes[i]; > @@ -4587,6 +4877,13 @@ static void vmap_init_nodes(void) > vn->lazy.root = RB_ROOT; > INIT_LIST_HEAD(&vn->lazy.head); > spin_lock_init(&vn->lazy.lock); > + > + for (j = 0; j < MAX_VA_SIZE_PAGES; j++) { > + INIT_LIST_HEAD(&vn->pool[j].head); > + WRITE_ONCE(vn->pool[j].len, 0); > + } > + > + spin_lock_init(&vn->pool_lock); > } > } > > -- > 2.39.2 >
On Thu, Feb 08, 2024 at 08:25:23AM +0800, Baoquan He wrote: > On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote: > ...... > > +static struct vmap_area * > > +node_alloc(unsigned long size, unsigned long align, > > + unsigned long vstart, unsigned long vend, > > + unsigned long *addr, unsigned int *vn_id) > > +{ > > + struct vmap_area *va; > > + > > + *vn_id = 0; > > + *addr = vend; > > + > > + /* > > + * Fallback to a global heap if not vmalloc or there > > + * is only one node. > > + */ > > + if (vstart != VMALLOC_START || vend != VMALLOC_END || > > + nr_vmap_nodes == 1) > > + return NULL; > > + > > + *vn_id = raw_smp_processor_id() % nr_vmap_nodes; > > + va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend); > > + *vn_id = encode_vn_id(*vn_id); > > + > > + if (va) > > + *addr = va->va_start; > > + > > + return va; > > +} > > + > > /* > > * Allocate a region of KVA of the specified size and alignment, within the > > * vstart and vend. > > @@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > > struct vmap_area *va; > > unsigned long freed; > > unsigned long addr; > > + unsigned int vn_id; > > int purged = 0; > > int ret; > > > > @@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > > return ERR_PTR(-EBUSY); > > > > might_sleep(); > > - gfp_mask = gfp_mask & GFP_RECLAIM_MASK; > > > > - va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); > > - if (unlikely(!va)) > > - return ERR_PTR(-ENOMEM); > > + /* > > + * If a VA is obtained from a global heap(if it fails here) > > + * it is anyway marked with this "vn_id" so it is returned > > + * to this pool's node later. Such way gives a possibility > > + * to populate pools based on users demand. > > + * > > + * On success a ready to go VA is returned. > > + */ > > + va = node_alloc(size, align, vstart, vend, &addr, &vn_id); > > Sorry for late checking. > No problem :) > Here, if no available va got, e.g a empty vp, still we will get an > effective vn_id with the current cpu_id for VMALLOC region allocation > request. > > > + if (!va) { > > + gfp_mask = gfp_mask & GFP_RECLAIM_MASK; > > + > > + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); > > + if (unlikely(!va)) > > + return ERR_PTR(-ENOMEM); > > + } > > > > /* > > * Only scan the relevant parts containing pointers to other objects > > @@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > > kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); > > > > retry: > > - preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); > > - addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, > > - size, align, vstart, vend); > > - spin_unlock(&free_vmap_area_lock); > > + if (addr == vend) { > > + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); > > + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, > > + size, align, vstart, vend); > > Then, here, we will get an available va from random location, but its > vn_id is from the current cpu. > > Then in purge_vmap_node(), we will decode the vn_id stored in va->flags, > and add the relevant va into vn->pool[] according to the vn_id. The > worst case could be most of va in vn->pool[] are not corresponding to > the vmap_nodes they belongs to. It doesn't matter? > We do not do any "in-front" population, instead it behaves as a cache miss when you need to access a main memmory to do a load and then keep the data in a cache. Same here. As a first step, for a CPU it always a miss, thus a VA is obtained from the global heap and is marked with a current CPU that makes an attempt to alloc. Later on that CPU/node is populated by that marked VA. So second alloc on same CPU goes via fast path. VAs are populated based on demand and those nodes which do allocations. > Should we adjust the code of vn_id assigning in node_alloc(), or I missed anything? Now it is open-coded. Some further refactoring should be done. Agree. -- Uladzislau Rezki
On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote: .....snip... > +static void > +decay_va_pool_node(struct vmap_node *vn, bool full_decay) > +{ > + struct vmap_area *va, *nva; > + struct list_head decay_list; > + struct rb_root decay_root; > + unsigned long n_decay; > + int i; > + > + decay_root = RB_ROOT; > + INIT_LIST_HEAD(&decay_list); > + > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > + struct list_head tmp_list; > + > + if (list_empty(&vn->pool[i].head)) > + continue; > + > + INIT_LIST_HEAD(&tmp_list); > + > + /* Detach the pool, so no-one can access it. */ > + spin_lock(&vn->pool_lock); > + list_replace_init(&vn->pool[i].head, &tmp_list); > + spin_unlock(&vn->pool_lock); > + > + if (full_decay) > + WRITE_ONCE(vn->pool[i].len, 0); > + > + /* Decay a pool by ~25% out of left objects. */ This isn't true if the pool has less than 4 objects. If there are 3 objects, n_decay = 0. > + n_decay = vn->pool[i].len >> 2; > + > + list_for_each_entry_safe(va, nva, &tmp_list, list) { > + list_del_init(&va->list); > + merge_or_add_vmap_area(va, &decay_root, &decay_list); > + > + if (!full_decay) { > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); > + > + if (!--n_decay) > + break; Here, --n_decay will make n_decay 0xffffffffffffffff, then all left objects are reclaimed. > + } > + } > + > + /* Attach the pool back if it has been partly decayed. */ > + if (!full_decay && !list_empty(&tmp_list)) { > + spin_lock(&vn->pool_lock); > + list_replace_init(&tmp_list, &vn->pool[i].head); > + spin_unlock(&vn->pool_lock); > + } > + } > + > + reclaim_list_global(&decay_list); > +} ......snip
On Wed, Feb 28, 2024 at 05:48:53PM +0800, Baoquan He wrote: > On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote: > .....snip... > > +static void > > +decay_va_pool_node(struct vmap_node *vn, bool full_decay) > > +{ > > + struct vmap_area *va, *nva; > > + struct list_head decay_list; > > + struct rb_root decay_root; > > + unsigned long n_decay; > > + int i; > > + > > + decay_root = RB_ROOT; > > + INIT_LIST_HEAD(&decay_list); > > + > > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > > + struct list_head tmp_list; > > + > > + if (list_empty(&vn->pool[i].head)) > > + continue; > > + > > + INIT_LIST_HEAD(&tmp_list); > > + > > + /* Detach the pool, so no-one can access it. */ > > + spin_lock(&vn->pool_lock); > > + list_replace_init(&vn->pool[i].head, &tmp_list); > > + spin_unlock(&vn->pool_lock); > > + > > + if (full_decay) > > + WRITE_ONCE(vn->pool[i].len, 0); > > + > > + /* Decay a pool by ~25% out of left objects. */ > > This isn't true if the pool has less than 4 objects. If there are 3 > objects, n_decay = 0. > This is expectable. > > + n_decay = vn->pool[i].len >> 2; > > + > > + list_for_each_entry_safe(va, nva, &tmp_list, list) { > > + list_del_init(&va->list); > > + merge_or_add_vmap_area(va, &decay_root, &decay_list); > > + > > + if (!full_decay) { > > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); > > + > > + if (!--n_decay) > > + break; > > Here, --n_decay will make n_decay 0xffffffffffffffff, > then all left objects are reclaimed. Right. Last three objects do not play a big game. -- Uladzislau Rezki
On 02/28/24 at 11:39am, Uladzislau Rezki wrote: > On Wed, Feb 28, 2024 at 05:48:53PM +0800, Baoquan He wrote: > > On 01/02/24 at 07:46pm, Uladzislau Rezki (Sony) wrote: > > .....snip... > > > +static void > > > +decay_va_pool_node(struct vmap_node *vn, bool full_decay) > > > +{ > > > + struct vmap_area *va, *nva; > > > + struct list_head decay_list; > > > + struct rb_root decay_root; > > > + unsigned long n_decay; > > > + int i; > > > + > > > + decay_root = RB_ROOT; > > > + INIT_LIST_HEAD(&decay_list); > > > + > > > + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > > > + struct list_head tmp_list; > > > + > > > + if (list_empty(&vn->pool[i].head)) > > > + continue; > > > + > > > + INIT_LIST_HEAD(&tmp_list); > > > + > > > + /* Detach the pool, so no-one can access it. */ > > > + spin_lock(&vn->pool_lock); > > > + list_replace_init(&vn->pool[i].head, &tmp_list); > > > + spin_unlock(&vn->pool_lock); > > > + > > > + if (full_decay) > > > + WRITE_ONCE(vn->pool[i].len, 0); > > > + > > > + /* Decay a pool by ~25% out of left objects. */ > > > > This isn't true if the pool has less than 4 objects. If there are 3 > > objects, n_decay = 0. > > > This is expectable. > > > > + n_decay = vn->pool[i].len >> 2; > > > + > > > + list_for_each_entry_safe(va, nva, &tmp_list, list) { > > > + list_del_init(&va->list); > > > + merge_or_add_vmap_area(va, &decay_root, &decay_list); > > > + > > > + if (!full_decay) { > > > + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); > > > + > > > + if (!--n_decay) > > > + break; > > > > Here, --n_decay will make n_decay 0xffffffffffffffff, > > then all left objects are reclaimed. > Right. Last three objects do not play a big game. See it now, thanks.
Hi, On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > Concurrent access to a global vmap space is a bottle-neck. > We can simulate a high contention by running a vmalloc test > suite. > > To address it, introduce an effective vmap node logic. Each > node behaves as independent entity. When a node is accessed > it serves a request directly(if possible) from its pool. > > This model has a size based pool for requests, i.e. pools are > serialized and populated based on object size and real demand. > A maximum object size that pool can handle is set to 256 pages. > > This technique reduces a pressure on the global vmap lock. > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> This patch results in a persistent "spinlock bad magic" message when booting s390 images with spinlock debugging enabled. [ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0 [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 0.466270] Call Trace: [ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8 [ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108 [ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108 [ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40 [ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150 [ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118 [ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68 [ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708 [ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8 [ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40 Bisect results and decoded stacktrace below. The uninitialized spinlock is &vn->busy.lock. Debugging shows that this lock is actually never initialized. [ 0.464684] ####### locking 0000000002280fb8 [ 0.464862] BUG: spinlock bad magic on CPU#0, swapper/0 ... [ 0.464684] ####### locking 0000000002280fb8 [ 0.477479] ####### locking 0000000002280fb8 [ 0.478166] ####### locking 0000000002280fb8 [ 0.478218] ####### locking 0000000002280fb8 ... [ 0.718250] #### busy lock init 0000000002871860 [ 0.718328] #### busy lock init 00000000028731b8 Only the initialized locks are used after the call to vmap_init_nodes(). Guenter --- # bad: [8e938e39866920ddc266898e6ae1fffc5c8f51aa] Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 # good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8 git bisect start 'HEAD' 'v6.8' # good: [e56bc745fa1de77abc2ad8debc4b1b83e0426c49] smb311: additional compression flag defined in updated protocol spec git bisect good e56bc745fa1de77abc2ad8debc4b1b83e0426c49 # bad: [902861e34c401696ed9ad17a54c8790e7e8e3069] Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm git bisect bad 902861e34c401696ed9ad17a54c8790e7e8e3069 # good: [480e035fc4c714fb5536e64ab9db04fedc89e910] Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel git bisect good 480e035fc4c714fb5536e64ab9db04fedc89e910 # good: [fe46a7dd189e25604716c03576d05ac8a5209743] Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect good fe46a7dd189e25604716c03576d05ac8a5209743 # bad: [435a75548109f19e5b5b14ae35b9acb063c084e9] mm: use folio more widely in __split_huge_page git bisect bad 435a75548109f19e5b5b14ae35b9acb063c084e9 # good: [4d5bf0b6183f79ea361dd506365d2a471270735c] mm/mmu_gather: add tlb_remove_tlb_entries() git bisect good 4d5bf0b6183f79ea361dd506365d2a471270735c # bad: [4daacfe8f99f4b4cef562649d56c48642981f46e] mm/damon/sysfs-schemes: support PSI-based quota auto-tune git bisect bad 4daacfe8f99f4b4cef562649d56c48642981f46e # good: [217b2119b9e260609958db413876f211038f00ee] mm,page_owner: implement the tracking of the stacks count git bisect good 217b2119b9e260609958db413876f211038f00ee # bad: [40254101d87870b2e5ac3ddc28af40aa04c48486] arm64, crash: wrap crash dumping code into crash related ifdefs git bisect bad 40254101d87870b2e5ac3ddc28af40aa04c48486 # bad: [53becf32aec1c8049b854f0c31a11df5ed75df6f] mm: vmalloc: support multiple nodes in vread_iter git bisect bad 53becf32aec1c8049b854f0c31a11df5ed75df6f # good: [7fa8cee003166ef6db0bba70d610dbf173543811] mm: vmalloc: move vmap_init_free_space() down in vmalloc.c git bisect good 7fa8cee003166ef6db0bba70d610dbf173543811 # good: [282631cb2447318e2a55b41a665dbe8571c46d70] mm: vmalloc: remove global purge_vmap_area_root rb-tree git bisect good 282631cb2447318e2a55b41a665dbe8571c46d70 # bad: [96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e] mm: vmalloc: add a scan area of VA only once git bisect bad 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e # bad: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock git bisect bad 72210662c5a2b6005f6daea7fe293a0dc573e1a5 # first bad commit: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock --- [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 0.466270] Call Trace: [ 0.466470] dump_stack_lvl (lib/dump_stack.c:117) [ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115) [ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364) [ 0.466572] find_vm_area (mm/vmalloc.c:3150) [ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393) [ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660) [ 0.466651] paging_init (arch/s390/mm/init.c:97) [ 0.466677] setup_arch (arch/s390/kernel/setup.c:972) [ 0.466702] start_kernel (init/main.c:899) [ 0.466727] startup_continue (arch/s390/kernel/head64.S:35) [ 0.466811] INFO: lockdep is turned off.
On Fri, Mar 22, 2024 at 11:21:02AM -0700, Guenter Roeck wrote: > Hi, > > On Tue, Jan 02, 2024 at 07:46:29PM +0100, Uladzislau Rezki (Sony) wrote: > > Concurrent access to a global vmap space is a bottle-neck. > > We can simulate a high contention by running a vmalloc test > > suite. > > > > To address it, introduce an effective vmap node logic. Each > > node behaves as independent entity. When a node is accessed > > it serves a request directly(if possible) from its pool. > > > > This model has a size based pool for requests, i.e. pools are > > serialized and populated based on object size and real demand. > > A maximum object size that pool can handle is set to 256 pages. > > > > This technique reduces a pressure on the global vmap lock. > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > This patch results in a persistent "spinlock bad magic" message > when booting s390 images with spinlock debugging enabled. > > [ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0 > [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 > [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 > [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) > [ 0.466270] Call Trace: > [ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8 > [ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108 > [ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108 > [ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40 > [ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150 > [ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118 > [ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68 > [ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708 > [ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8 > [ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40 > > Bisect results and decoded stacktrace below. > > The uninitialized spinlock is &vn->busy.lock. > Debugging shows that this lock is actually never initialized. > It is. Once the vmalloc_init() "main entry" function is called from the: <snip> start_kernel() mm_core_init() vmalloc_init() <snip> > [ 0.464684] ####### locking 0000000002280fb8 > [ 0.464862] BUG: spinlock bad magic on CPU#0, swapper/0 > ... > [ 0.464684] ####### locking 0000000002280fb8 > [ 0.477479] ####### locking 0000000002280fb8 > [ 0.478166] ####### locking 0000000002280fb8 > [ 0.478218] ####### locking 0000000002280fb8 > ... > [ 0.718250] #### busy lock init 0000000002871860 > [ 0.718328] #### busy lock init 00000000028731b8 > > Only the initialized locks are used after the call to vmap_init_nodes(). > Right, when the vmap space and vmalloc is initialized. > Guenter > > --- > # bad: [8e938e39866920ddc266898e6ae1fffc5c8f51aa] Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 > # good: [e8f897f4afef0031fe618a8e94127a0934896aba] Linux 6.8 > git bisect start 'HEAD' 'v6.8' > # good: [e56bc745fa1de77abc2ad8debc4b1b83e0426c49] smb311: additional compression flag defined in updated protocol spec > git bisect good e56bc745fa1de77abc2ad8debc4b1b83e0426c49 > # bad: [902861e34c401696ed9ad17a54c8790e7e8e3069] Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > git bisect bad 902861e34c401696ed9ad17a54c8790e7e8e3069 > # good: [480e035fc4c714fb5536e64ab9db04fedc89e910] Merge tag 'drm-next-2024-03-13' of https://gitlab.freedesktop.org/drm/kernel > git bisect good 480e035fc4c714fb5536e64ab9db04fedc89e910 > # good: [fe46a7dd189e25604716c03576d05ac8a5209743] Merge tag 'sound-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound > git bisect good fe46a7dd189e25604716c03576d05ac8a5209743 > # bad: [435a75548109f19e5b5b14ae35b9acb063c084e9] mm: use folio more widely in __split_huge_page > git bisect bad 435a75548109f19e5b5b14ae35b9acb063c084e9 > # good: [4d5bf0b6183f79ea361dd506365d2a471270735c] mm/mmu_gather: add tlb_remove_tlb_entries() > git bisect good 4d5bf0b6183f79ea361dd506365d2a471270735c > # bad: [4daacfe8f99f4b4cef562649d56c48642981f46e] mm/damon/sysfs-schemes: support PSI-based quota auto-tune > git bisect bad 4daacfe8f99f4b4cef562649d56c48642981f46e > # good: [217b2119b9e260609958db413876f211038f00ee] mm,page_owner: implement the tracking of the stacks count > git bisect good 217b2119b9e260609958db413876f211038f00ee > # bad: [40254101d87870b2e5ac3ddc28af40aa04c48486] arm64, crash: wrap crash dumping code into crash related ifdefs > git bisect bad 40254101d87870b2e5ac3ddc28af40aa04c48486 > # bad: [53becf32aec1c8049b854f0c31a11df5ed75df6f] mm: vmalloc: support multiple nodes in vread_iter > git bisect bad 53becf32aec1c8049b854f0c31a11df5ed75df6f > # good: [7fa8cee003166ef6db0bba70d610dbf173543811] mm: vmalloc: move vmap_init_free_space() down in vmalloc.c > git bisect good 7fa8cee003166ef6db0bba70d610dbf173543811 > # good: [282631cb2447318e2a55b41a665dbe8571c46d70] mm: vmalloc: remove global purge_vmap_area_root rb-tree > git bisect good 282631cb2447318e2a55b41a665dbe8571c46d70 > # bad: [96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e] mm: vmalloc: add a scan area of VA only once > git bisect bad 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e > # bad: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock > git bisect bad 72210662c5a2b6005f6daea7fe293a0dc573e1a5 > # first bad commit: [72210662c5a2b6005f6daea7fe293a0dc573e1a5] mm: vmalloc: offload free_vmap_area_lock lock > > --- > [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 > [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 > [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) > [ 0.466270] Call Trace: > [ 0.466470] dump_stack_lvl (lib/dump_stack.c:117) > [ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115) > [ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364) > [ 0.466572] find_vm_area (mm/vmalloc.c:3150) > [ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393) > [ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660) > [ 0.466651] paging_init (arch/s390/mm/init.c:97) > [ 0.466677] setup_arch (arch/s390/kernel/setup.c:972) > [ 0.466702] start_kernel (init/main.c:899) > [ 0.466727] startup_continue (arch/s390/kernel/head64.S:35) > [ 0.466811] INFO: lockdep is turned off. > <snip> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 22aa63f4ef63..0d77d171b5d9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr) struct vmap_area *va; int i, j; + if (unlikely(!vmap_initialized)) + return NULL; + /* * An addr_to_node_id(addr) converts an address to a node index * where a VA is located. If VA spans several zones and passed <snip> Could you please test it? -- Uladzislau Rezki
On 3/22/24 12:03, Uladzislau Rezki wrote: [ ... ] > <snip> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 22aa63f4ef63..0d77d171b5d9 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr) > struct vmap_area *va; > int i, j; > > + if (unlikely(!vmap_initialized)) > + return NULL; > + > /* > * An addr_to_node_id(addr) converts an address to a node index > * where a VA is located. If VA spans several zones and passed > <snip> > > Could you please test it? > That fixes the problem. Thanks, Guenter
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 9b2f1b0cac9d..fa4ab2bbbc5b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -775,7 +775,22 @@ struct rb_list { spinlock_t lock; }; +struct vmap_pool { + struct list_head head; + unsigned long len; +}; + +/* + * A fast size storage contains VAs up to 1M size. + */ +#define MAX_VA_SIZE_PAGES 256 + static struct vmap_node { + /* Simple size segregated storage. */ + struct vmap_pool pool[MAX_VA_SIZE_PAGES]; + spinlock_t pool_lock; + bool skip_populate; + /* Bookkeeping data of this node. */ struct rb_list busy; struct rb_list lazy; @@ -784,6 +799,8 @@ static struct vmap_node { * Ready-to-free areas. */ struct list_head purge_list; + struct work_struct purge_work; + unsigned long nr_purged; } single; static struct vmap_node *vmap_nodes = &single; @@ -802,6 +819,61 @@ addr_to_node(unsigned long addr) return &vmap_nodes[addr_to_node_id(addr)]; } +static inline struct vmap_node * +id_to_node(unsigned int id) +{ + return &vmap_nodes[id % nr_vmap_nodes]; +} + +/* + * We use the value 0 to represent "no node", that is why + * an encoded value will be the node-id incremented by 1. + * It is always greater then 0. A valid node_id which can + * be encoded is [0:nr_vmap_nodes - 1]. If a passed node_id + * is not valid 0 is returned. + */ +static unsigned int +encode_vn_id(unsigned int node_id) +{ + /* Can store U8_MAX [0:254] nodes. */ + if (node_id < nr_vmap_nodes) + return (node_id + 1) << BITS_PER_BYTE; + + /* Warn and no node encoded. */ + WARN_ONCE(1, "Encode wrong node id (%u)\n", node_id); + return 0; +} + +/* + * Returns an encoded node-id, the valid range is within + * [0:nr_vmap_nodes-1] values. Otherwise nr_vmap_nodes is + * returned if extracted data is wrong. + */ +static unsigned int +decode_vn_id(unsigned int val) +{ + unsigned int node_id = (val >> BITS_PER_BYTE) - 1; + + /* Can store U8_MAX [0:254] nodes. */ + if (node_id < nr_vmap_nodes) + return node_id; + + /* If it was _not_ zero, warn. */ + WARN_ONCE(node_id != UINT_MAX, + "Decode wrong node id (%d)\n", node_id); + + return nr_vmap_nodes; +} + +static bool +is_vn_id_valid(unsigned int node_id) +{ + if (node_id < nr_vmap_nodes) + return true; + + return false; +} + static __always_inline unsigned long va_size(struct vmap_area *va) { @@ -1623,6 +1695,104 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node) kmem_cache_free(vmap_area_cachep, va); } +static struct vmap_pool * +size_to_va_pool(struct vmap_node *vn, unsigned long size) +{ + unsigned int idx = (size - 1) / PAGE_SIZE; + + if (idx < MAX_VA_SIZE_PAGES) + return &vn->pool[idx]; + + return NULL; +} + +static bool +node_pool_add_va(struct vmap_node *n, struct vmap_area *va) +{ + struct vmap_pool *vp; + + vp = size_to_va_pool(n, va_size(va)); + if (!vp) + return false; + + spin_lock(&n->pool_lock); + list_add(&va->list, &vp->head); + WRITE_ONCE(vp->len, vp->len + 1); + spin_unlock(&n->pool_lock); + + return true; +} + +static struct vmap_area * +node_pool_del_va(struct vmap_node *vn, unsigned long size, + unsigned long align, unsigned long vstart, + unsigned long vend) +{ + struct vmap_area *va = NULL; + struct vmap_pool *vp; + int err = 0; + + vp = size_to_va_pool(vn, size); + if (!vp || list_empty(&vp->head)) + return NULL; + + spin_lock(&vn->pool_lock); + if (!list_empty(&vp->head)) { + va = list_first_entry(&vp->head, struct vmap_area, list); + + if (IS_ALIGNED(va->va_start, align)) { + /* + * Do some sanity check and emit a warning + * if one of below checks detects an error. + */ + err |= (va_size(va) != size); + err |= (va->va_start < vstart); + err |= (va->va_end > vend); + + if (!WARN_ON_ONCE(err)) { + list_del_init(&va->list); + WRITE_ONCE(vp->len, vp->len - 1); + } else { + va = NULL; + } + } else { + list_move_tail(&va->list, &vp->head); + va = NULL; + } + } + spin_unlock(&vn->pool_lock); + + return va; +} + +static struct vmap_area * +node_alloc(unsigned long size, unsigned long align, + unsigned long vstart, unsigned long vend, + unsigned long *addr, unsigned int *vn_id) +{ + struct vmap_area *va; + + *vn_id = 0; + *addr = vend; + + /* + * Fallback to a global heap if not vmalloc or there + * is only one node. + */ + if (vstart != VMALLOC_START || vend != VMALLOC_END || + nr_vmap_nodes == 1) + return NULL; + + *vn_id = raw_smp_processor_id() % nr_vmap_nodes; + va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend); + *vn_id = encode_vn_id(*vn_id); + + if (va) + *addr = va->va_start; + + return va; +} + /* * Allocate a region of KVA of the specified size and alignment, within the * vstart and vend. @@ -1637,6 +1807,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, struct vmap_area *va; unsigned long freed; unsigned long addr; + unsigned int vn_id; int purged = 0; int ret; @@ -1647,11 +1818,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, return ERR_PTR(-EBUSY); might_sleep(); - gfp_mask = gfp_mask & GFP_RECLAIM_MASK; - va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); - if (unlikely(!va)) - return ERR_PTR(-ENOMEM); + /* + * If a VA is obtained from a global heap(if it fails here) + * it is anyway marked with this "vn_id" so it is returned + * to this pool's node later. Such way gives a possibility + * to populate pools based on users demand. + * + * On success a ready to go VA is returned. + */ + va = node_alloc(size, align, vstart, vend, &addr, &vn_id); + if (!va) { + gfp_mask = gfp_mask & GFP_RECLAIM_MASK; + + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); + if (unlikely(!va)) + return ERR_PTR(-ENOMEM); + } /* * Only scan the relevant parts containing pointers to other objects @@ -1660,10 +1843,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); retry: - preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); - addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, - size, align, vstart, vend); - spin_unlock(&free_vmap_area_lock); + if (addr == vend) { + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, + size, align, vstart, vend); + spin_unlock(&free_vmap_area_lock); + } trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend); @@ -1677,7 +1862,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->va_start = addr; va->va_end = addr + size; va->vm = NULL; - va->flags = va_flags; + va->flags = (va_flags | vn_id); vn = addr_to_node(va->va_start); @@ -1770,63 +1955,135 @@ static DEFINE_MUTEX(vmap_purge_lock); static void purge_fragmented_blocks_allcpus(void); static cpumask_t purge_nodes; -/* - * Purges all lazily-freed vmap areas. - */ -static unsigned long -purge_vmap_node(struct vmap_node *vn) +static void +reclaim_list_global(struct list_head *head) { - unsigned long num_purged_areas = 0; - struct vmap_area *va, *n_va; + struct vmap_area *va, *n; - if (list_empty(&vn->purge_list)) - return 0; + if (list_empty(head)) + return; spin_lock(&free_vmap_area_lock); + list_for_each_entry_safe(va, n, head, list) + merge_or_add_vmap_area_augment(va, + &free_vmap_area_root, &free_vmap_area_list); + spin_unlock(&free_vmap_area_lock); +} + +static void +decay_va_pool_node(struct vmap_node *vn, bool full_decay) +{ + struct vmap_area *va, *nva; + struct list_head decay_list; + struct rb_root decay_root; + unsigned long n_decay; + int i; + + decay_root = RB_ROOT; + INIT_LIST_HEAD(&decay_list); + + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { + struct list_head tmp_list; + + if (list_empty(&vn->pool[i].head)) + continue; + + INIT_LIST_HEAD(&tmp_list); + + /* Detach the pool, so no-one can access it. */ + spin_lock(&vn->pool_lock); + list_replace_init(&vn->pool[i].head, &tmp_list); + spin_unlock(&vn->pool_lock); + + if (full_decay) + WRITE_ONCE(vn->pool[i].len, 0); + + /* Decay a pool by ~25% out of left objects. */ + n_decay = vn->pool[i].len >> 2; + + list_for_each_entry_safe(va, nva, &tmp_list, list) { + list_del_init(&va->list); + merge_or_add_vmap_area(va, &decay_root, &decay_list); + + if (!full_decay) { + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); + + if (!--n_decay) + break; + } + } + + /* Attach the pool back if it has been partly decayed. */ + if (!full_decay && !list_empty(&tmp_list)) { + spin_lock(&vn->pool_lock); + list_replace_init(&tmp_list, &vn->pool[i].head); + spin_unlock(&vn->pool_lock); + } + } + + reclaim_list_global(&decay_list); +} + +static void purge_vmap_node(struct work_struct *work) +{ + struct vmap_node *vn = container_of(work, + struct vmap_node, purge_work); + struct vmap_area *va, *n_va; + LIST_HEAD(local_list); + + vn->nr_purged = 0; + list_for_each_entry_safe(va, n_va, &vn->purge_list, list) { unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; unsigned long orig_start = va->va_start; unsigned long orig_end = va->va_end; + unsigned int vn_id = decode_vn_id(va->flags); - /* - * Finally insert or merge lazily-freed area. It is - * detached and there is no need to "unlink" it from - * anything. - */ - va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root, - &free_vmap_area_list); - - if (!va) - continue; + list_del_init(&va->list); if (is_vmalloc_or_module_addr((void *)orig_start)) kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); atomic_long_sub(nr, &vmap_lazy_nr); - num_purged_areas++; + vn->nr_purged++; + + if (is_vn_id_valid(vn_id) && !vn->skip_populate) + if (node_pool_add_va(vn, va)) + continue; + + /* Go back to global. */ + list_add(&va->list, &local_list); } - spin_unlock(&free_vmap_area_lock); - return num_purged_areas; + reclaim_list_global(&local_list); } /* * Purges all lazily-freed vmap areas. */ -static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, + bool full_pool_decay) { - unsigned long num_purged_areas = 0; + unsigned long nr_purged_areas = 0; + unsigned int nr_purge_helpers; + unsigned int nr_purge_nodes; struct vmap_node *vn; int i; lockdep_assert_held(&vmap_purge_lock); + + /* + * Use cpumask to mark which node has to be processed. + */ purge_nodes = CPU_MASK_NONE; for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i]; INIT_LIST_HEAD(&vn->purge_list); + vn->skip_populate = full_pool_decay; + decay_va_pool_node(vn, full_pool_decay); if (RB_EMPTY_ROOT(&vn->lazy.root)) continue; @@ -1845,17 +2102,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) cpumask_set_cpu(i, &purge_nodes); } - if (cpumask_weight(&purge_nodes) > 0) { + nr_purge_nodes = cpumask_weight(&purge_nodes); + if (nr_purge_nodes > 0) { flush_tlb_kernel_range(start, end); + /* One extra worker is per a lazy_max_pages() full set minus one. */ + nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); + nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1; + for_each_cpu(i, &purge_nodes) { - vn = &nodes[i]; - num_purged_areas += purge_vmap_node(vn); + vn = &vmap_nodes[i]; + + if (nr_purge_helpers > 0) { + INIT_WORK(&vn->purge_work, purge_vmap_node); + + if (cpumask_test_cpu(i, cpu_online_mask)) + schedule_work_on(i, &vn->purge_work); + else + schedule_work(&vn->purge_work); + + nr_purge_helpers--; + } else { + vn->purge_work.func = NULL; + purge_vmap_node(&vn->purge_work); + nr_purged_areas += vn->nr_purged; + } + } + + for_each_cpu(i, &purge_nodes) { + vn = &vmap_nodes[i]; + + if (vn->purge_work.func) { + flush_work(&vn->purge_work); + nr_purged_areas += vn->nr_purged; + } } } - trace_purge_vmap_area_lazy(start, end, num_purged_areas); - return num_purged_areas > 0; + trace_purge_vmap_area_lazy(start, end, nr_purged_areas); + return nr_purged_areas > 0; } /* @@ -1866,14 +2151,14 @@ static void reclaim_and_purge_vmap_areas(void) { mutex_lock(&vmap_purge_lock); purge_fragmented_blocks_allcpus(); - __purge_vmap_area_lazy(ULONG_MAX, 0); + __purge_vmap_area_lazy(ULONG_MAX, 0, true); mutex_unlock(&vmap_purge_lock); } static void drain_vmap_area_work(struct work_struct *work) { mutex_lock(&vmap_purge_lock); - __purge_vmap_area_lazy(ULONG_MAX, 0); + __purge_vmap_area_lazy(ULONG_MAX, 0, false); mutex_unlock(&vmap_purge_lock); } @@ -1884,9 +2169,10 @@ static void drain_vmap_area_work(struct work_struct *work) */ static void free_vmap_area_noflush(struct vmap_area *va) { - struct vmap_node *vn = addr_to_node(va->va_start); unsigned long nr_lazy_max = lazy_max_pages(); unsigned long va_start = va->va_start; + unsigned int vn_id = decode_vn_id(va->flags); + struct vmap_node *vn; unsigned long nr_lazy; if (WARN_ON_ONCE(!list_empty(&va->list))) @@ -1896,10 +2182,14 @@ static void free_vmap_area_noflush(struct vmap_area *va) PAGE_SHIFT, &vmap_lazy_nr); /* - * Merge or place it to the purge tree/list. + * If it was request by a certain node we would like to + * return it to that node, i.e. its pool for later reuse. */ + vn = is_vn_id_valid(vn_id) ? + id_to_node(vn_id):addr_to_node(va->va_start); + spin_lock(&vn->lazy.lock); - merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head); + insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head); spin_unlock(&vn->lazy.lock); trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max); @@ -2408,7 +2698,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) } free_purged_blocks(&purge_list); - if (!__purge_vmap_area_lazy(start, end) && flush) + if (!__purge_vmap_area_lazy(start, end, false) && flush) flush_tlb_kernel_range(start, end); mutex_unlock(&vmap_purge_lock); } @@ -4576,7 +4866,7 @@ static void vmap_init_free_space(void) static void vmap_init_nodes(void) { struct vmap_node *vn; - int i; + int i, j; for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i]; @@ -4587,6 +4877,13 @@ static void vmap_init_nodes(void) vn->lazy.root = RB_ROOT; INIT_LIST_HEAD(&vn->lazy.head); spin_lock_init(&vn->lazy.lock); + + for (j = 0; j < MAX_VA_SIZE_PAGES; j++) { + INIT_LIST_HEAD(&vn->pool[j].head); + WRITE_ONCE(vn->pool[j].len, 0); + } + + spin_lock_init(&vn->pool_lock); } }
Concurrent access to a global vmap space is a bottle-neck. We can simulate a high contention by running a vmalloc test suite. To address it, introduce an effective vmap node logic. Each node behaves as independent entity. When a node is accessed it serves a request directly(if possible) from its pool. This model has a size based pool for requests, i.e. pools are serialized and populated based on object size and real demand. A maximum object size that pool can handle is set to 256 pages. This technique reduces a pressure on the global vmap lock. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> --- mm/vmalloc.c | 387 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 342 insertions(+), 45 deletions(-)