Message ID | 20210919164239.49905-1-42.hyeyoo@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] Introducing lockless cache built on top of slab allocator | expand |
On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote: > It is just simple proof of concept, and not ready for submission yet. > There can be wrong code (like wrong gfp flags, or wrong error handling, > etc) it is just simple proof of concept. I want comment from you. Have you read: https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/ The relevant part of that paper is section 3, magazines. We should have low and high water marks for number of objects, and we should allocate from / free to the slab allocator in batches. Slab has bulk alloc/free APIs already. I'd rather see this be part of the slab allocator than a separate API.
Hello Matthew, Thanks to give me a comment! I appreciate it. On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote: > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote: > > It is just simple proof of concept, and not ready for submission yet. > > There can be wrong code (like wrong gfp flags, or wrong error handling, > > etc) it is just simple proof of concept. I want comment from you. > > Have you read: > > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/ > The relevant part of that paper is section 3, magazines. We should have > low and high water marks for number of objects I haven't read that before, but after reading it seems not different from SLAB's percpu queuing. > and we should allocate > from / free to the slab allocator in batches. Slab has bulk alloc/free > APIs already. > There's kmem_cache_alloc_{bulk,free} functions for bulk allocation. But it's designed for large number of allocation to reduce locking cost, not for percpu lockless allocation. Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free} but kmem_cache_alloc_{free,bulk} is not enough. > I'd rather see this be part of the slab allocator than a separate API. And I disagree on this. for because most of situation, we cannot allocate without lock, it is special case for IO polling. To make it as part of slab allocator, we need to modify existing data structure. But making it part of slab allocator will be waste of memory because most of them are not using this.
On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote: > Hello Matthew, Thanks to give me a comment! I appreciate it. > > On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote: > > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote: > > > It is just simple proof of concept, and not ready for submission yet. > > > There can be wrong code (like wrong gfp flags, or wrong error handling, > > > etc) it is just simple proof of concept. I want comment from you. > > > > Have you read: > > > > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/ > > The relevant part of that paper is section 3, magazines. We should have > > low and high water marks for number of objects > > I haven't read that before, but after reading it seems not different from > SLAB's percpu queuing. > > > and we should allocate > > from / free to the slab allocator in batches. Slab has bulk alloc/free > > APIs already. > > > > There's kmem_cache_alloc_{bulk,free} functions for bulk > allocation. But it's designed for large number of allocation > to reduce locking cost, not for percpu lockless allocation. What I'm saying is that rather than a linked list of objects, we should have an array of, say, 15 pointers per CPU (and a count of how many allocations we have). If we are trying to allocate and have no objects, call kmem_cache_alloc_bulk() for 8 objects. If we are trying to free and have 15 objects already, call kmem_cache_free_bulk() for the last 8 objects and set the number of allocated objects to 7. (maybe 8 and 15 are the wrong numbers. this is just an example) > Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free} > but kmem_cache_alloc_{free,bulk} is not enough. > > > I'd rather see this be part of the slab allocator than a separate API. > > And I disagree on this. for because most of situation, we cannot > allocate without lock, it is special case for IO polling. > > To make it as part of slab allocator, we need to modify existing data > structure. But making it part of slab allocator will be waste of memory > because most of them are not using this. Oh, it would have to be an option. Maybe as a new slab_flags_t flag. Or maybe a kmem_cache_alloc_percpu_lockless().
On Mon, Sep 20, 2021 at 02:53:34AM +0100, Matthew Wilcox wrote: > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote: > > Hello Matthew, Thanks to give me a comment! I appreciate it. > > > > On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote: > > > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote: > > > > It is just simple proof of concept, and not ready for submission yet. > > > > There can be wrong code (like wrong gfp flags, or wrong error handling, > > > > etc) it is just simple proof of concept. I want comment from you. > > > > > > Have you read: > > > > > > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/ > > > The relevant part of that paper is section 3, magazines. We should have > > > low and high water marks for number of objects > > > > I haven't read that before, but after reading it seems not different from > > SLAB's percpu queuing. > > > > > and we should allocate > > > from / free to the slab allocator in batches. Slab has bulk alloc/free > > > APIs already. > > > > > > > There's kmem_cache_alloc_{bulk,free} functions for bulk > > allocation. But it's designed for large number of allocation > > to reduce locking cost, not for percpu lockless allocation. > > What I'm saying is that rather than a linked list of objects, we should > have an array of, say, 15 pointers per CPU (and a count of how many > allocations we have). If we are trying to allocate and have no objects, > call kmem_cache_alloc_bulk() for 8 objects. If we are trying to free > and have 15 objects already, call kmem_cache_free_bulk() for the last > 8 objects and set the number of allocated objects to 7. > > (maybe 8 and 15 are the wrong numbers. this is just an example) > Ah, Okay. it seems better to use array. Using cache for list is unnecessary cost. array is simpler. > > Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free} > > but kmem_cache_alloc_{free,bulk} is not enough. > > > > > I'd rather see this be part of the slab allocator than a separate API. > > > > And I disagree on this. for because most of situation, we cannot > > allocate without lock, it is special case for IO polling. > > > > To make it as part of slab allocator, we need to modify existing data > > structure. But making it part of slab allocator will be waste of memory > > because most of them are not using this. > > Oh, it would have to be an option. Maybe as a new slab_flags_t flag. > Or maybe a kmem_cache_alloc_percpu_lockless(). Oh, Now I got what you mean. That is a good improvement! For example, there is a slab_flags_t flag like SLAB_LOCKLESS. and a cache created with SLAB_LOCKLESS flag can allocate using both kmem_cache_alloc, or kmem_cache_alloc_percpu_lockless depending on situation? (I suggest kmem_cache_alloc_lockless is better name) it seems MUCH better. (because it prevents duplicating a cache) I'll send RFC v2 soon. Thank you so much Matthew. If there's misunderstanding from me, please let me know. Thanks, Hyeonggon Yoo
On 9/20/21 03:53, Matthew Wilcox wrote: > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote: >> Hello Matthew, Thanks to give me a comment! I appreciate it. >> Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free} >> but kmem_cache_alloc_{free,bulk} is not enough. >> >> > I'd rather see this be part of the slab allocator than a separate API. >> >> And I disagree on this. for because most of situation, we cannot >> allocate without lock, it is special case for IO polling. >> >> To make it as part of slab allocator, we need to modify existing data >> structure. But making it part of slab allocator will be waste of memory >> because most of them are not using this. > > Oh, it would have to be an option. Maybe as a new slab_flags_t flag. > Or maybe a kmem_cache_alloc_percpu_lockless(). I've recently found out that similar attempts (introduce queueing to SLUB) have been done around 2010. See e.g. [1] but there will be other threads to search at lore too. Haven't checked yet while it wasn't ultimately merged, I guess Christoph and David could remember (this was before my time). I guess making it opt-in only for caches where performance improvement was measured would make it easier to add, as for some caches it would mean no improvement, but increased memory usage. But of course it makes the API more harder to use. I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly lockless" therefore fast, but if the cache is empty, it will still take locks as part of refill? Or is it lockless always, therefore useful in contexts that can take no locks, but then the caller has to have fallbacks in case the cache is empty and nothing is allocated? [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u
On Mon, Sep 20, 2021 at 11:07:36AM +0200, Vlastimil Babka wrote: > On 9/20/21 03:53, Matthew Wilcox wrote: > > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote: > >> Hello Matthew, Thanks to give me a comment! I appreciate it. > >> Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free} > >> but kmem_cache_alloc_{free,bulk} is not enough. > >> > >> > I'd rather see this be part of the slab allocator than a separate API. > >> > >> And I disagree on this. for because most of situation, we cannot > >> allocate without lock, it is special case for IO polling. > >> > >> To make it as part of slab allocator, we need to modify existing data > >> structure. But making it part of slab allocator will be waste of memory > >> because most of them are not using this. > > > > Oh, it would have to be an option. Maybe as a new slab_flags_t flag. > > Or maybe a kmem_cache_alloc_percpu_lockless(). > > I've recently found out that similar attempts (introduce queueing to SLUB) > have been done around 2010. See e.g. [1] but there will be other threads to > search at lore too. Haven't checked yet while it wasn't ultimately merged, > I guess Christoph and David could remember (this was before my time). There was attempt on SLUB with queueing as you said. I searched a bit and found [2] and [3]. - SLUB with queueing (V2) beats SLAB netperf TCP_RR, 2010-07 [2] https://lore.kernel.org/lkml/alpine.DEB.2.00.1007121010420.14328@router.home/T/#m5a31c7caa28b93a00de3af6d547b79273449f5ba - The Unified slab allocator (V4), 2010-10 [3] https://linux-mm.kvack.narkive.com/e595iCuz/unifiedv4-00-16-the-unified-slab-allocator-v4#post47 Looking at [3], There was still some regression comparing "SLUB with queueing" with SLAB. And I couldn't find patch series after [3] yet. I'll add link if I find. > I guess making it opt-in only for caches where performance improvement was > measured would make it easier to add, as for some caches it would mean no > improvement, but increased memory usage. But of course it makes the API more > harder to use. Do you mean "lockless cache" it should be separate from slab because some caches doesn't benefit at all? > I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly > lockless" therefore fast, but if the cache is empty, it will still take > locks as part of refill? It is actually "mostly lockless" so it is ambiguous. Can you suggest a name? like try_lockless or anything? > Or is it lockless always, therefore useful in > contexts that can take no locks, but then the caller has to have fallbacks > in case the cache is empty and nothing is allocated? > > [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u
On 9/20/21 13:55, Hyeonggon Yoo wrote: > On Mon, Sep 20, 2021 at 11:07:36AM +0200, Vlastimil Babka wrote: >> I guess making it opt-in only for caches where performance improvement was >> measured would make it easier to add, as for some caches it would mean no >> improvement, but increased memory usage. But of course it makes the API more >> harder to use. > > Do you mean "lockless cache" it should be separate from slab because some caches > doesn't benefit at all? I meant it seems to be a valid approach to have a special kmem_cache flag and allocation function variants, as you discussed. That covers the "some caches don't benefit at all" while being an integral part of the allocator, so others don't have to build ad-hoc solutions on top of it, and possibly it can be also more optimized given access to the SLUB internals. >> I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly >> lockless" therefore fast, but if the cache is empty, it will still take >> locks as part of refill? > > It is actually "mostly lockless" so it is ambiguous. > Can you suggest a name? like try_lockless or anything? "cached" instead of "lockless" ? >> Or is it lockless always, therefore useful in >> contexts that can take no locks, but then the caller has to have fallbacks >> in case the cache is empty and nothing is allocated? >> >> [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u >
On 20/09/2021 02:53, Matthew Wilcox wrote: > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote: >> Hello Matthew, Thanks to give me a comment! I appreciate it. >> >> On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote: >>> On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote: >>>> It is just simple proof of concept, and not ready for submission yet. >>>> There can be wrong code (like wrong gfp flags, or wrong error handling, >>>> etc) it is just simple proof of concept. I want comment from you. >>> >>> Have you read: >>> >>> https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/ >>> The relevant part of that paper is section 3, magazines. We should have >>> low and high water marks for number of objects >> In case unknown, jfyi there is an implementation of this in drivers/iommu/iova.c Thanks, John >> I haven't read that before, but after reading it seems not different from >> SLAB's percpu queuing. >> >>> and we should allocate >>> from / free to the slab allocator in batches. Slab has bulk alloc/free >>> APIs already. >>>
On Mon, Sep 20, 2021 at 03:41:16PM +0100, John Garry wrote: > On 20/09/2021 02:53, Matthew Wilcox wrote: > > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote: > > > Hello Matthew, Thanks to give me a comment! I appreciate it. > > > > > > On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote: > > > > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote: > > > > > It is just simple proof of concept, and not ready for submission yet. > > > > > There can be wrong code (like wrong gfp flags, or wrong error handling, > > > > > etc) it is just simple proof of concept. I want comment from you. > > > > > > > > Have you read: > > > > > > > > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/ > > > > The relevant part of that paper is section 3, magazines. We should have > > > > low and high water marks for number of objects > > > > > In case unknown, jfyi there is an implementation of this in > drivers/iommu/iova.c Thanks for good information. I'll take a look at it! Thanks, Hyeonggon > Thanks, > John > > > > I haven't read that before, but after reading it seems not different from > > > SLAB's percpu queuing. > > > > and we should allocate > > > > from / free to the slab allocator in batches. Slab has bulk alloc/free > > > > APIs already. > > > >
On Mon, Sep 20, 2021 at 02:02:19PM +0200, Vlastimil Babka wrote: > On 9/20/21 13:55, Hyeonggon Yoo wrote: > > On Mon, Sep 20, 2021 at 11:07:36AM +0200, Vlastimil Babka wrote: > >> I guess making it opt-in only for caches where performance improvement was > >> measured would make it easier to add, as for some caches it would mean no > >> improvement, but increased memory usage. But of course it makes the API more > >> harder to use. > > > > Do you mean "lockless cache" it should be separate from slab because some caches > > doesn't benefit at all? > > I meant it seems to be a valid approach to have a special kmem_cache flag > and allocation function variants, as you discussed. That covers the "some > caches don't benefit at all" while being an integral part of the allocator, > so others don't have to build ad-hoc solutions on top of it, and possibly it > can be also more optimized given access to the SLUB internals. Okay! I sent RFC v2. please check if how does look like to you: https://lore.kernel.org/linux-mm/20210920154816.31832-1-42.hyeyoo@gmail.com/T/#u > >> I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly > >> lockless" therefore fast, but if the cache is empty, it will still take > >> locks as part of refill? > > > > It is actually "mostly lockless" so it is ambiguous. > > Can you suggest a name? like try_lockless or anything? > > "cached" instead of "lockless" ? > added kmem_cache_alloc_cached, kmem_cache_free_cached in v2. Thanks for your opinion Vlastimil, Hyeonggon. > >> Or is it lockless always, therefore useful in > >> contexts that can take no locks, but then the caller has to have fallbacks > >> in case the cache is empty and nothing is allocated? > >> > >> [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u > >
diff --git a/include/linux/lockless_cache.h b/include/linux/lockless_cache.h new file mode 100644 index 000000000000..e64b85e869f3 --- /dev/null +++ b/include/linux/lockless_cache.h @@ -0,0 +1,31 @@ +#include <linux/gfp.h> + +struct object_list { + void *object; + struct list_head list; +}; + +struct freelist { + struct object_list *head; + int size; +}; + +struct lockless_cache { + struct kmem_cache *cache; + struct freelist __percpu *freelist; + + int total_size; + unsigned int max; /* maximum size for each percpu freelist */ + unsigned int slack; /* number of objects returning to slab when freelist is too big (> max) */ +}; + +void lockless_cache_init(void); +struct lockless_cache +*lockless_cache_create(const char *name, unsigned int size, unsigned int align, + slab_flags_t flags, void (*ctor)(void *), unsigned int max, + unsigned int slack); + +void lockless_cache_destroy(struct lockless_cache *cache); +void *lockless_cache_alloc(struct lockless_cache *cache, gfp_t flags); +void lockless_cache_free(struct lockless_cache *cache, void *object); + diff --git a/init/main.c b/init/main.c index 3f7216934441..c18d6421cb65 100644 --- a/init/main.c +++ b/init/main.c @@ -79,6 +79,7 @@ #include <linux/async.h> #include <linux/shmem_fs.h> #include <linux/slab.h> +#include <linux/lockless_cache.h> #include <linux/perf_event.h> #include <linux/ptrace.h> #include <linux/pti.h> @@ -848,6 +849,7 @@ static void __init mm_init(void) /* page_owner must be initialized after buddy is ready */ page_ext_init_flatmem_late(); kmem_cache_init(); + lockless_cache_init(); kmemleak_init(); pgtable_init(); debug_objects_mem_init(); diff --git a/mm/Makefile b/mm/Makefile index fc60a40ce954..d6c3a89ed548 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -52,7 +52,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o vmacache.o \ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o mmap_lock.o $(mmu-y) + debug.o gup.o mmap_lock.o lockless_cache.o $(mmu-y) # Give 'page_alloc' its own module-parameter namespace page-alloc-y := page_alloc.o diff --git a/mm/lockless_cache.c b/mm/lockless_cache.c new file mode 100644 index 000000000000..05b8cdb672ff --- /dev/null +++ b/mm/lockless_cache.c @@ -0,0 +1,132 @@ +#include <linux/kernel.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/list.h> +#include <linux/percpu-defs.h> +#include <linux/lockless_cache.h> + +#ifdef CONFIG_SLUB +#include <linux/slub_def.h> +#elif CONFIG_SLAB +#include <linux/slab_def.h> +#else +#include <linux/slob_def.h> +#endif + +static struct kmem_cache *global_lockless_cache; +static struct kmem_cache *global_list_cache; + +/* + * What should to do if initialization fails? + */ +void lockless_cache_init(void) +{ + global_lockless_cache = kmem_cache_create("global_lockless_cache", sizeof(struct lockless_cache), + sizeof(struct lockless_cache), 0, NULL); + + global_list_cache = kmem_cache_create("global_list_cache", sizeof(struct object_list), + sizeof(struct object_list), 0, NULL); + +} +EXPORT_SYMBOL(lockless_cache_init); + +struct lockless_cache +*lockless_cache_create(const char *name, unsigned int size, unsigned int align, + slab_flags_t flags, void (*ctor)(void *), unsigned int max, unsigned int slack) +{ + int cpu; + struct lockless_cache *cache; + + cache = kmem_cache_alloc(global_lockless_cache, GFP_KERNEL || __GFP_ZERO); + if (!cache) + return NULL; + + cache->cache = kmem_cache_create(name, size, align, 0, ctor); + if (!cache->cache) + goto destroy_cache; + + cache->freelist = alloc_percpu(struct freelist); + if (!cache->freelist) + goto destroy_cache; + + cache->max = max; + cache->slack = slack; + cache->total_size = 0; + + for_each_possible_cpu(cpu) { + struct freelist *freelist; + freelist = per_cpu_ptr(cache->freelist, cpu); + INIT_LIST_HEAD(&freelist->head->list); + freelist->size = 0; + } + + return cache; + +destroy_cache: + + lockless_cache_destroy(cache); + return cache; +} +EXPORT_SYMBOL(lockless_cache_create); + +void lockless_cache_destroy(struct lockless_cache *cache) +{ + int cpu; + struct object_list *elem; + + for_each_possible_cpu(cpu) { + free_percpu(cache->freelist); + list_for_each_entry(elem, &cache->freelist->head->list, list) { + lockless_cache_free(cache, elem->object); + kmem_cache_free(global_list_cache, elem); + } + } + + kmem_cache_destroy(cache->cache); +} +EXPORT_SYMBOL(lockless_cache_destroy); + +void *lockless_cache_alloc(struct lockless_cache *cache, gfp_t flags) +{ + struct freelist *freelist; + struct object_list *elem; + + freelist = this_cpu_ptr(cache->freelist); + + if (list_empty(&freelist->head->list)) { + elem = freelist->head; + list_del(&freelist->head->list); + cache->total_size--; + freelist->size--; + cache->cache->ctor(elem->object); + } else { + elem = kmem_cache_alloc(global_list_cache, flags); + } + + return elem->object; +} +EXPORT_SYMBOL(lockless_cache_alloc); + +void lockless_cache_free(struct lockless_cache *cache, void *object) +{ + struct freelist *freelist; + struct object_list *elem; + + elem = container_of(&object, struct object_list, object); + freelist = this_cpu_ptr(cache->freelist); + list_add(&freelist->head->list, &elem->list); + cache->total_size++; + freelist->size++; + + /* return back to slab allocator */ + if (freelist->size > cache->max) { + elem = list_last_entry(&freelist->head->list, struct object_list, list); + list_del(&elem->list); + + kmem_cache_free(cache->cache, elem->object); + kmem_cache_free(global_list_cache, elem); + cache->total_size--; + freelist->size--; + } +} +EXPORT_SYMBOL(lockless_cache_free);