diff mbox series

[RFC] Introducing lockless cache built on top of slab allocator

Message ID 20210919164239.49905-1-42.hyeyoo@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] Introducing lockless cache built on top of slab allocator | expand

Commit Message

Hyeonggon Yoo Sept. 19, 2021, 4:42 p.m. UTC
It is just simple proof of concept, and not ready for submission yet.
There can be wrong code (like wrong gfp flags, or wrong error handling,
etc) it is just simple proof of concept. I want comment from you.

Recently block layer implemented percpu, lockless cache
on the top of slab allocator. It can be used for IO polling,
because IO polling disables interrupt.

Link: https://lwn.net/Articles/868070/
Link: https://www.spinics.net/lists/linux-block/msg71964.html

it gained some IOPS increase after this.
(Note that Jens used SLUB on performance measurement)

this is generalization of what have been done in block layer,
built on top of slab allocator.

lockless cache uses simple queuing to be more cache friendly.
and when the percpu freelist gets too big, it returns some
objects back to slab allocator.

it seems lockless cache can be used network layer's IO Polling (NAPI) too.

Any ideas/opinions on this?
---
 include/linux/lockless_cache.h |  31 ++++++++
 init/main.c                    |   2 +
 mm/Makefile                    |   2 +-
 mm/lockless_cache.c            | 132 +++++++++++++++++++++++++++++++++
 4 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/lockless_cache.h
 create mode 100644 mm/lockless_cache.c

Comments

Matthew Wilcox (Oracle) Sept. 19, 2021, 7:17 p.m. UTC | #1
On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote:
> It is just simple proof of concept, and not ready for submission yet.
> There can be wrong code (like wrong gfp flags, or wrong error handling,
> etc) it is just simple proof of concept. I want comment from you.

Have you read:

https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/

The relevant part of that paper is section 3, magazines.  We should have
low and high water marks for number of objects, and we should allocate
from / free to the slab allocator in batches.  Slab has bulk alloc/free
APIs already.

I'd rather see this be part of the slab allocator than a separate API.
Hyeonggon Yoo Sept. 20, 2021, 1:09 a.m. UTC | #2
Hello Matthew, Thanks to give me a comment! I appreciate it.

On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote:
> On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote:
> > It is just simple proof of concept, and not ready for submission yet.
> > There can be wrong code (like wrong gfp flags, or wrong error handling,
> > etc) it is just simple proof of concept. I want comment from you.
> 
> Have you read:
> 
> https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
> The relevant part of that paper is section 3, magazines.  We should have
> low and high water marks for number of objects

I haven't read that before, but after reading it seems not different from
SLAB's percpu queuing.
 
> and we should allocate
> from / free to the slab allocator in batches.  Slab has bulk alloc/free
> APIs already.
> 

There's kmem_cache_alloc_{bulk,free} functions for bulk
allocation. But it's designed for large number of allocation
to reduce locking cost, not for percpu lockless allocation.

Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free}
but kmem_cache_alloc_{free,bulk} is not enough.

> I'd rather see this be part of the slab allocator than a separate API.

And I disagree on this. for because most of situation, we cannot
allocate without lock, it is special case for IO polling.

To make it as part of slab allocator, we need to modify existing data
structure. But making it part of slab allocator will be waste of memory
because most of them are not using this.
Matthew Wilcox (Oracle) Sept. 20, 2021, 1:53 a.m. UTC | #3
On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote:
> Hello Matthew, Thanks to give me a comment! I appreciate it.
> 
> On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote:
> > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote:
> > > It is just simple proof of concept, and not ready for submission yet.
> > > There can be wrong code (like wrong gfp flags, or wrong error handling,
> > > etc) it is just simple proof of concept. I want comment from you.
> > 
> > Have you read:
> > 
> > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
> > The relevant part of that paper is section 3, magazines.  We should have
> > low and high water marks for number of objects
> 
> I haven't read that before, but after reading it seems not different from
> SLAB's percpu queuing.
>  
> > and we should allocate
> > from / free to the slab allocator in batches.  Slab has bulk alloc/free
> > APIs already.
> > 
> 
> There's kmem_cache_alloc_{bulk,free} functions for bulk
> allocation. But it's designed for large number of allocation
> to reduce locking cost, not for percpu lockless allocation.

What I'm saying is that rather than a linked list of objects, we should
have an array of, say, 15 pointers per CPU (and a count of how many
allocations we have).  If we are trying to allocate and have no objects,
call kmem_cache_alloc_bulk() for 8 objects.  If we are trying to free
and have 15 objects already, call kmem_cache_free_bulk() for the last
8 objects and set the number of allocated objects to 7.

(maybe 8 and 15 are the wrong numbers.  this is just an example)

> Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free}
> but kmem_cache_alloc_{free,bulk} is not enough.
> 
> > I'd rather see this be part of the slab allocator than a separate API.
> 
> And I disagree on this. for because most of situation, we cannot
> allocate without lock, it is special case for IO polling.
> 
> To make it as part of slab allocator, we need to modify existing data
> structure. But making it part of slab allocator will be waste of memory
> because most of them are not using this.

Oh, it would have to be an option.  Maybe as a new slab_flags_t flag.
Or maybe a kmem_cache_alloc_percpu_lockless().
Hyeonggon Yoo Sept. 20, 2021, 2:54 a.m. UTC | #4
On Mon, Sep 20, 2021 at 02:53:34AM +0100, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote:
> > Hello Matthew, Thanks to give me a comment! I appreciate it.
> > 
> > On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote:
> > > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote:
> > > > It is just simple proof of concept, and not ready for submission yet.
> > > > There can be wrong code (like wrong gfp flags, or wrong error handling,
> > > > etc) it is just simple proof of concept. I want comment from you.
> > > 
> > > Have you read:
> > > 
> > > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
> > > The relevant part of that paper is section 3, magazines.  We should have
> > > low and high water marks for number of objects
> > 
> > I haven't read that before, but after reading it seems not different from
> > SLAB's percpu queuing.
> >  
> > > and we should allocate
> > > from / free to the slab allocator in batches.  Slab has bulk alloc/free
> > > APIs already.
> > > 
> > 
> > There's kmem_cache_alloc_{bulk,free} functions for bulk
> > allocation. But it's designed for large number of allocation
> > to reduce locking cost, not for percpu lockless allocation.
> 
> What I'm saying is that rather than a linked list of objects, we should
> have an array of, say, 15 pointers per CPU (and a count of how many
> allocations we have).  If we are trying to allocate and have no objects,
> call kmem_cache_alloc_bulk() for 8 objects.  If we are trying to free
> and have 15 objects already, call kmem_cache_free_bulk() for the last
> 8 objects and set the number of allocated objects to 7.
> 
> (maybe 8 and 15 are the wrong numbers.  this is just an example)
>

Ah, Okay. it seems better to use array. Using cache for list is
unnecessary cost. array is simpler.

> > Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free}
> > but kmem_cache_alloc_{free,bulk} is not enough.
> > 
> > > I'd rather see this be part of the slab allocator than a separate API.
> > 
> > And I disagree on this. for because most of situation, we cannot
> > allocate without lock, it is special case for IO polling.
> > 
> > To make it as part of slab allocator, we need to modify existing data
> > structure. But making it part of slab allocator will be waste of memory
> > because most of them are not using this.
> 
> Oh, it would have to be an option.  Maybe as a new slab_flags_t flag.
> Or maybe a kmem_cache_alloc_percpu_lockless().

Oh, Now I got what you mean. That is a good improvement!

For example,
there is a slab_flags_t flag like SLAB_LOCKLESS.
and a cache created with SLAB_LOCKLESS flag can allocate 
using both kmem_cache_alloc, or kmem_cache_alloc_percpu_lockless
depending on situation? (I suggest kmem_cache_alloc_lockless is better name)

it seems MUCH better. (because it prevents duplicating a cache)

I'll send RFC v2 soon.
Thank you so much Matthew.

If there's misunderstanding from me, please let me know.

Thanks,
Hyeonggon Yoo
Vlastimil Babka Sept. 20, 2021, 9:07 a.m. UTC | #5
On 9/20/21 03:53, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote:
>> Hello Matthew, Thanks to give me a comment! I appreciate it.
>> Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free}
>> but kmem_cache_alloc_{free,bulk} is not enough.
>> 
>> > I'd rather see this be part of the slab allocator than a separate API.
>> 
>> And I disagree on this. for because most of situation, we cannot
>> allocate without lock, it is special case for IO polling.
>> 
>> To make it as part of slab allocator, we need to modify existing data
>> structure. But making it part of slab allocator will be waste of memory
>> because most of them are not using this.
> 
> Oh, it would have to be an option.  Maybe as a new slab_flags_t flag.
> Or maybe a kmem_cache_alloc_percpu_lockless().

I've recently found out that similar attempts (introduce queueing to SLUB)
have been done around 2010. See e.g. [1] but there will be other threads to
search at lore too. Haven't checked yet while it wasn't ultimately merged, I
guess Christoph and David could remember (this was before my time).

I guess making it opt-in only for caches where performance improvement was
measured would make it easier to add, as for some caches it would mean no
improvement, but increased memory usage. But of course it makes the API more
harder to use.

I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly
lockless" therefore fast, but if the cache is empty, it will still take
locks as part of refill? Or is it lockless always, therefore useful in
contexts that can take no locks, but then the caller has to have fallbacks
in case the cache is empty and nothing is allocated?

[1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u
Hyeonggon Yoo Sept. 20, 2021, 11:55 a.m. UTC | #6
On Mon, Sep 20, 2021 at 11:07:36AM +0200, Vlastimil Babka wrote:
> On 9/20/21 03:53, Matthew Wilcox wrote:
> > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote:
> >> Hello Matthew, Thanks to give me a comment! I appreciate it.
> >> Yeah, we can implement lockless cache using kmem_cache_alloc_{bulk, free}
> >> but kmem_cache_alloc_{free,bulk} is not enough.
> >> 
> >> > I'd rather see this be part of the slab allocator than a separate API.
> >> 
> >> And I disagree on this. for because most of situation, we cannot
> >> allocate without lock, it is special case for IO polling.
> >> 
> >> To make it as part of slab allocator, we need to modify existing data
> >> structure. But making it part of slab allocator will be waste of memory
> >> because most of them are not using this.
> > 
> > Oh, it would have to be an option.  Maybe as a new slab_flags_t flag.
> > Or maybe a kmem_cache_alloc_percpu_lockless().
> 
> I've recently found out that similar attempts (introduce queueing to SLUB)
> have been done around 2010. See e.g. [1] but there will be other threads to
> search at lore too. Haven't checked yet while it wasn't ultimately merged, 
> I guess Christoph and David could remember (this was before my time).

There was attempt on SLUB with queueing as you said.
I searched a bit and found [2] and [3].

- SLUB with queueing (V2) beats SLAB netperf TCP_RR, 2010-07
[2] https://lore.kernel.org/lkml/alpine.DEB.2.00.1007121010420.14328@router.home/T/#m5a31c7caa28b93a00de3af6d547b79273449f5ba

- The Unified slab allocator (V4), 2010-10
[3] https://linux-mm.kvack.narkive.com/e595iCuz/unifiedv4-00-16-the-unified-slab-allocator-v4#post47

Looking at [3], There was still some regression comparing "SLUB with queueing"
with SLAB. And I couldn't find patch series after [3] yet. I'll add link
if I find.

> I guess making it opt-in only for caches where performance improvement was
> measured would make it easier to add, as for some caches it would mean no
> improvement, but increased memory usage. But of course it makes the API more
> harder to use.

Do you mean "lockless cache" it should be separate from slab because some caches
doesn't benefit at all?

> I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly
> lockless" therefore fast, but if the cache is empty, it will still take
> locks as part of refill?

It is actually "mostly lockless" so it is ambiguous.
Can you suggest a name? like try_lockless or anything?

> Or is it lockless always, therefore useful in
> contexts that can take no locks, but then the caller has to have fallbacks
> in case the cache is empty and nothing is allocated?
> 
> [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u
Vlastimil Babka Sept. 20, 2021, 12:02 p.m. UTC | #7
On 9/20/21 13:55, Hyeonggon Yoo wrote:
> On Mon, Sep 20, 2021 at 11:07:36AM +0200, Vlastimil Babka wrote:
>> I guess making it opt-in only for caches where performance improvement was
>> measured would make it easier to add, as for some caches it would mean no
>> improvement, but increased memory usage. But of course it makes the API more
>> harder to use.
> 
> Do you mean "lockless cache" it should be separate from slab because some caches
> doesn't benefit at all?

I meant it seems to be a valid approach to have a special kmem_cache flag
and allocation function variants, as you discussed. That covers the "some
caches don't benefit at all" while being an integral part of the allocator,
so others don't have to build ad-hoc solutions on top of it, and possibly it
can be also more optimized given access to the SLUB internals.

>> I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly
>> lockless" therefore fast, but if the cache is empty, it will still take
>> locks as part of refill?
> 
> It is actually "mostly lockless" so it is ambiguous.
> Can you suggest a name? like try_lockless or anything?

"cached" instead of "lockless" ?

>> Or is it lockless always, therefore useful in
>> contexts that can take no locks, but then the caller has to have fallbacks
>> in case the cache is empty and nothing is allocated?
>> 
>> [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u
>
John Garry Sept. 20, 2021, 2:41 p.m. UTC | #8
On 20/09/2021 02:53, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote:
>> Hello Matthew, Thanks to give me a comment! I appreciate it.
>>
>> On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote:
>>> On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote:
>>>> It is just simple proof of concept, and not ready for submission yet.
>>>> There can be wrong code (like wrong gfp flags, or wrong error handling,
>>>> etc) it is just simple proof of concept. I want comment from you.
>>>
>>> Have you read:
>>>
>>> https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
>>> The relevant part of that paper is section 3, magazines.  We should have
>>> low and high water marks for number of objects
>>

In case unknown, jfyi there is an implementation of this in 
drivers/iommu/iova.c

Thanks,
John

>> I haven't read that before, but after reading it seems not different from
>> SLAB's percpu queuing.
>>   
>>> and we should allocate
>>> from / free to the slab allocator in batches.  Slab has bulk alloc/free
>>> APIs already.
>>>
Hyeonggon Yoo Sept. 20, 2021, 3:50 p.m. UTC | #9
On Mon, Sep 20, 2021 at 03:41:16PM +0100, John Garry wrote:
> On 20/09/2021 02:53, Matthew Wilcox wrote:
> > On Mon, Sep 20, 2021 at 01:09:38AM +0000, Hyeonggon Yoo wrote:
> > > Hello Matthew, Thanks to give me a comment! I appreciate it.
> > > 
> > > On Sun, Sep 19, 2021 at 08:17:44PM +0100, Matthew Wilcox wrote:
> > > > On Sun, Sep 19, 2021 at 04:42:39PM +0000, Hyeonggon Yoo wrote:
> > > > > It is just simple proof of concept, and not ready for submission yet.
> > > > > There can be wrong code (like wrong gfp flags, or wrong error handling,
> > > > > etc) it is just simple proof of concept. I want comment from you.
> > > > 
> > > > Have you read:
> > > > 
> > > > https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
> > > > The relevant part of that paper is section 3, magazines.  We should have
> > > > low and high water marks for number of objects
> > > 
> 
> In case unknown, jfyi there is an implementation of this in
> drivers/iommu/iova.c

Thanks for good information. I'll take a look at it!

Thanks,
Hyeonggon

> Thanks,
> John
> 
> > > I haven't read that before, but after reading it seems not different from
> > > SLAB's percpu queuing.
> > > > and we should allocate
> > > > from / free to the slab allocator in batches.  Slab has bulk alloc/free
> > > > APIs already.
> > > >
Hyeonggon Yoo Sept. 20, 2021, 3:55 p.m. UTC | #10
On Mon, Sep 20, 2021 at 02:02:19PM +0200, Vlastimil Babka wrote:
> On 9/20/21 13:55, Hyeonggon Yoo wrote:
> > On Mon, Sep 20, 2021 at 11:07:36AM +0200, Vlastimil Babka wrote:
> >> I guess making it opt-in only for caches where performance improvement was
> >> measured would make it easier to add, as for some caches it would mean no
> >> improvement, but increased memory usage. But of course it makes the API more
> >> harder to use.
> > 
> > Do you mean "lockless cache" it should be separate from slab because some caches
> > doesn't benefit at all?
> 
> I meant it seems to be a valid approach to have a special kmem_cache flag
> and allocation function variants, as you discussed. That covers the "some
> caches don't benefit at all" while being an integral part of the allocator,
> so others don't have to build ad-hoc solutions on top of it, and possibly it
> can be also more optimized given access to the SLUB internals.

Okay! I sent RFC v2. please check if how does look like to you:
	https://lore.kernel.org/linux-mm/20210920154816.31832-1-42.hyeyoo@gmail.com/T/#u

> >> I'd be careful about the name "lockless", as that's ambiguous. Is it "mostly
> >> lockless" therefore fast, but if the cache is empty, it will still take
> >> locks as part of refill?
> > 
> > It is actually "mostly lockless" so it is ambiguous.
> > Can you suggest a name? like try_lockless or anything?
> 
> "cached" instead of "lockless" ?
>

added kmem_cache_alloc_cached, kmem_cache_free_cached in v2.

Thanks for your opinion Vlastimil,
Hyeonggon.

> >> Or is it lockless always, therefore useful in
> >> contexts that can take no locks, but then the caller has to have fallbacks
> >> in case the cache is empty and nothing is allocated?
> >> 
> >> [1] https://lore.kernel.org/linux-mm/20100804024531.914852850@linux.com/T/#u
> >
diff mbox series

Patch

diff --git a/include/linux/lockless_cache.h b/include/linux/lockless_cache.h
new file mode 100644
index 000000000000..e64b85e869f3
--- /dev/null
+++ b/include/linux/lockless_cache.h
@@ -0,0 +1,31 @@ 
+#include <linux/gfp.h>
+
+struct object_list {
+	void *object;
+	struct list_head list;
+};
+
+struct freelist {
+	struct object_list *head;
+	int size;
+};
+
+struct lockless_cache {
+	struct kmem_cache *cache;
+	struct freelist __percpu *freelist;
+
+	int total_size;
+	unsigned int max; /* maximum size for each percpu freelist */
+	unsigned int slack; /* number of objects returning to slab when freelist is too big (> max) */
+};
+
+void lockless_cache_init(void);
+struct lockless_cache
+*lockless_cache_create(const char *name, unsigned int size, unsigned int align,
+		slab_flags_t flags, void (*ctor)(void *), unsigned int max,
+		unsigned int slack);
+
+void lockless_cache_destroy(struct lockless_cache *cache);
+void *lockless_cache_alloc(struct lockless_cache *cache, gfp_t flags);
+void lockless_cache_free(struct lockless_cache *cache, void *object);
+
diff --git a/init/main.c b/init/main.c
index 3f7216934441..c18d6421cb65 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,6 +79,7 @@ 
 #include <linux/async.h>
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
+#include <linux/lockless_cache.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/pti.h>
@@ -848,6 +849,7 @@  static void __init mm_init(void)
 	/* page_owner must be initialized after buddy is ready */
 	page_ext_init_flatmem_late();
 	kmem_cache_init();
+	lockless_cache_init();
 	kmemleak_init();
 	pgtable_init();
 	debug_objects_mem_init();
diff --git a/mm/Makefile b/mm/Makefile
index fc60a40ce954..d6c3a89ed548 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o lockless_cache.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/lockless_cache.c b/mm/lockless_cache.c
new file mode 100644
index 000000000000..05b8cdb672ff
--- /dev/null
+++ b/mm/lockless_cache.c
@@ -0,0 +1,132 @@ 
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/percpu-defs.h>
+#include <linux/lockless_cache.h>
+
+#ifdef CONFIG_SLUB
+#include <linux/slub_def.h>
+#elif CONFIG_SLAB
+#include <linux/slab_def.h>
+#else
+#include <linux/slob_def.h>
+#endif
+
+static struct kmem_cache *global_lockless_cache;
+static struct kmem_cache *global_list_cache;
+
+/*
+ * What should to do if initialization fails?
+ */
+void lockless_cache_init(void)
+{
+	global_lockless_cache = kmem_cache_create("global_lockless_cache", sizeof(struct lockless_cache),
+			sizeof(struct lockless_cache), 0, NULL);
+
+	global_list_cache = kmem_cache_create("global_list_cache", sizeof(struct object_list),
+			sizeof(struct object_list), 0, NULL);
+
+}
+EXPORT_SYMBOL(lockless_cache_init);
+
+struct lockless_cache
+*lockless_cache_create(const char *name, unsigned int size, unsigned int align,
+		slab_flags_t flags, void (*ctor)(void *), unsigned int max, unsigned int slack)
+{
+	int cpu;
+	struct lockless_cache *cache;
+
+	cache = kmem_cache_alloc(global_lockless_cache, GFP_KERNEL || __GFP_ZERO);
+	if (!cache)
+		return NULL;
+
+	cache->cache = kmem_cache_create(name, size, align, 0, ctor);
+	if (!cache->cache)
+		goto destroy_cache;
+
+	cache->freelist = alloc_percpu(struct freelist);
+	if (!cache->freelist)
+		goto destroy_cache;
+
+	cache->max = max;
+	cache->slack = slack;
+	cache->total_size = 0;
+
+	for_each_possible_cpu(cpu) {
+		struct freelist *freelist;
+		freelist = per_cpu_ptr(cache->freelist, cpu);
+		INIT_LIST_HEAD(&freelist->head->list);
+		freelist->size = 0;
+	}
+
+	return cache;
+
+destroy_cache:
+
+	lockless_cache_destroy(cache);
+	return cache;
+}
+EXPORT_SYMBOL(lockless_cache_create);
+
+void lockless_cache_destroy(struct lockless_cache *cache)
+{
+	int cpu;
+	struct object_list *elem;
+
+	for_each_possible_cpu(cpu) {
+		free_percpu(cache->freelist);
+		list_for_each_entry(elem, &cache->freelist->head->list, list) {
+			lockless_cache_free(cache, elem->object);
+			kmem_cache_free(global_list_cache, elem);
+		}
+	}
+
+	kmem_cache_destroy(cache->cache);
+}
+EXPORT_SYMBOL(lockless_cache_destroy);
+
+void *lockless_cache_alloc(struct lockless_cache *cache, gfp_t flags)
+{
+	struct freelist *freelist;
+	struct object_list *elem;
+
+	freelist = this_cpu_ptr(cache->freelist);
+
+	if (list_empty(&freelist->head->list)) {
+		elem = freelist->head;
+		list_del(&freelist->head->list);
+		cache->total_size--;
+		freelist->size--;
+		cache->cache->ctor(elem->object);
+	} else {
+		elem = kmem_cache_alloc(global_list_cache, flags);
+	}
+
+	return elem->object;
+}
+EXPORT_SYMBOL(lockless_cache_alloc);
+
+void lockless_cache_free(struct lockless_cache *cache, void *object)
+{
+	struct freelist *freelist;
+	struct object_list *elem;
+
+	elem = container_of(&object, struct object_list, object);
+	freelist = this_cpu_ptr(cache->freelist);
+	list_add(&freelist->head->list, &elem->list);
+	cache->total_size++;
+	freelist->size++;
+
+	/* return back to slab allocator */
+	if (freelist->size > cache->max) {
+		elem = list_last_entry(&freelist->head->list, struct object_list, list);
+		list_del(&elem->list);
+
+		kmem_cache_free(cache->cache, elem->object);
+		kmem_cache_free(global_list_cache, elem);
+		cache->total_size--;
+		freelist->size--;
+	}
+}
+EXPORT_SYMBOL(lockless_cache_free);