diff mbox series

[1/2] objpool: enable inlining objpool_push() and objpool_pop() operations

Message ID 20240424215214.3956041-2-andrii@kernel.org (mailing list archive)
State Accepted
Commit a3b00f10da808bd4a354f890b551cba471082d0e
Headers show
Series Objpool performance improvements | expand

Commit Message

Andrii Nakryiko April 24, 2024, 9:52 p.m. UTC
objpool_push() and objpool_pop() are very performance-critical functions
and can be called very frequently in kretprobe triggering path.

As such, it makes sense to allow compiler to inline them completely to
eliminate function calls overhead. Luckily, their logic is quite well
isolated and doesn't have any sprawling dependencies.

This patch moves both objpool_push() and objpool_pop() into
include/linux/objpool.h and marks them as static inline functions,
enabling inlining. To avoid anyone using internal helpers
(objpool_try_get_slot, objpool_try_add_slot), rename them to use leading
underscores.

We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe
and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi
programs in a tight loop to evaluate the effect. BPF own overhead in
this case is minimal and it mostly stresses the rest of in-kernel
kretprobe infrastructure overhead. Results are in millions of calls per
second. This is not super scientific, but shows the trend nevertheless.

BEFORE
======
kretprobe      :    9.794 ± 0.086M/s
kretprobe-multi:   10.219 ± 0.032M/s

AFTER
=====
kretprobe      :    9.937 ± 0.174M/s (+1.5%)
kretprobe-multi:   10.440 ± 0.108M/s (+2.2%)

Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/objpool.h | 101 +++++++++++++++++++++++++++++++++++++++-
 lib/objpool.c           | 100 ---------------------------------------
 2 files changed, 99 insertions(+), 102 deletions(-)

Comments

Vlastimil Babka May 7, 2024, 1:55 p.m. UTC | #1
On 4/24/24 11:52 PM, Andrii Nakryiko wrote:
> objpool_push() and objpool_pop() are very performance-critical functions
> and can be called very frequently in kretprobe triggering path.
> 
> As such, it makes sense to allow compiler to inline them completely to
> eliminate function calls overhead. Luckily, their logic is quite well
> isolated and doesn't have any sprawling dependencies.
> 
> This patch moves both objpool_push() and objpool_pop() into
> include/linux/objpool.h and marks them as static inline functions,
> enabling inlining. To avoid anyone using internal helpers
> (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading
> underscores.
> 
> We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe
> and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi
> programs in a tight loop to evaluate the effect. BPF own overhead in
> this case is minimal and it mostly stresses the rest of in-kernel
> kretprobe infrastructure overhead. Results are in millions of calls per
> second. This is not super scientific, but shows the trend nevertheless.
> 
> BEFORE
> ======
> kretprobe      :    9.794 ± 0.086M/s
> kretprobe-multi:   10.219 ± 0.032M/s
> 
> AFTER
> =====
> kretprobe      :    9.937 ± 0.174M/s (+1.5%)
> kretprobe-multi:   10.440 ± 0.108M/s (+2.2%)
> 
> Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Hello,

this question is not specific to your patch, but since it's a recent thread,
I'll ask it here instead of digging up the original objpool patches.

I'm trying to understand how objpool works and if it could be integrated
into SLUB, for the LSF/MM discussion next week:

https://lore.kernel.org/all/b929d5fb-8e88-4f23-8ec7-6bdaf61f84f9@suse.cz/

> +/* adding object to slot, abort if the slot was already full */

I don't see any actual abort in the code (not in this code nor in the
deleted code - it's the same code, just moved for inlining purposes).

> +static inline int
> +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> +{
> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
> +	uint32_t head, tail;
> +
> +	/* loading tail and head as a local snapshot, tail first */
> +	tail = READ_ONCE(slot->tail);
> +
> +	do {
> +		head = READ_ONCE(slot->head);
> +		/* fault caught: something must be wrong */
> +		WARN_ON_ONCE(tail - head > pool->nr_objs);

So this will only WARN if we go over the capacity, but continue and
overwrite a pointer that was already there, effectively leaking said object, no?

> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
> +
> +	/* now the tail position is reserved for the given obj */
> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> +	/* update sequence to make this obj available for pop() */
> +	smp_store_release(&slot->last, tail + 1);
> +
> +	return 0;
> +}
>  
>  /**
>   * objpool_push() - reclaim the object and return back to objpool
> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool);
>   * return: 0 or error code (it fails only when user tries to push
>   * the same object multiple times or wrong "objects" into objpool)
>   */
> -int objpool_push(void *obj, struct objpool_head *pool);
> +static inline int objpool_push(void *obj, struct objpool_head *pool)
> +{
> +	unsigned long flags;
> +	int rc;
> +
> +	/* disable local irq to avoid preemption & interruption */
> +	raw_local_irq_save(flags);
> +	rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());

And IIUC, we could in theory objpool_pop() on one cpu, then later another
cpu might do objpool_push() and cause the latter cpu's pool to go over
capacity? Is there some implicit requirements of objpool users to take care
of having matched cpu for pop and push? Are the current objpool users
obeying this requirement? (I can see the selftests do, not sure about the
actual users).
Or am I missing something? Thanks.

> +	raw_local_irq_restore(flags);
> +
> +	return rc;
> +}
> +
>  
>  /**
>   * objpool_drop() - discard the object and deref objpool
> diff --git a/lib/objpool.c b/lib/objpool.c
> index cfdc02420884..f696308fc026 100644
> --- a/lib/objpool.c
> +++ b/lib/objpool.c
> @@ -152,106 +152,6 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>  }
>  EXPORT_SYMBOL_GPL(objpool_init);
>  
> -/* adding object to slot, abort if the slot was already full */
> -static inline int
> -objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> -{
> -	struct objpool_slot *slot = pool->cpu_slots[cpu];
> -	uint32_t head, tail;
> -
> -	/* loading tail and head as a local snapshot, tail first */
> -	tail = READ_ONCE(slot->tail);
> -
> -	do {
> -		head = READ_ONCE(slot->head);
> -		/* fault caught: something must be wrong */
> -		WARN_ON_ONCE(tail - head > pool->nr_objs);
> -	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
> -
> -	/* now the tail position is reserved for the given obj */
> -	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> -	/* update sequence to make this obj available for pop() */
> -	smp_store_release(&slot->last, tail + 1);
> -
> -	return 0;
> -}
> -
> -/* reclaim an object to object pool */
> -int objpool_push(void *obj, struct objpool_head *pool)
> -{
> -	unsigned long flags;
> -	int rc;
> -
> -	/* disable local irq to avoid preemption & interruption */
> -	raw_local_irq_save(flags);
> -	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> -	raw_local_irq_restore(flags);
> -
> -	return rc;
> -}
> -EXPORT_SYMBOL_GPL(objpool_push);
> -
> -/* try to retrieve object from slot */
> -static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
> -{
> -	struct objpool_slot *slot = pool->cpu_slots[cpu];
> -	/* load head snapshot, other cpus may change it */
> -	uint32_t head = smp_load_acquire(&slot->head);
> -
> -	while (head != READ_ONCE(slot->last)) {
> -		void *obj;
> -
> -		/*
> -		 * data visibility of 'last' and 'head' could be out of
> -		 * order since memory updating of 'last' and 'head' are
> -		 * performed in push() and pop() independently
> -		 *
> -		 * before any retrieving attempts, pop() must guarantee
> -		 * 'last' is behind 'head', that is to say, there must
> -		 * be available objects in slot, which could be ensured
> -		 * by condition 'last != head && last - head <= nr_objs'
> -		 * that is equivalent to 'last - head - 1 < nr_objs' as
> -		 * 'last' and 'head' are both unsigned int32
> -		 */
> -		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
> -			head = READ_ONCE(slot->head);
> -			continue;
> -		}
> -
> -		/* obj must be retrieved before moving forward head */
> -		obj = READ_ONCE(slot->entries[head & slot->mask]);
> -
> -		/* move head forward to mark it's consumption */
> -		if (try_cmpxchg_release(&slot->head, &head, head + 1))
> -			return obj;
> -	}
> -
> -	return NULL;
> -}
> -
> -/* allocate an object from object pool */
> -void *objpool_pop(struct objpool_head *pool)
> -{
> -	void *obj = NULL;
> -	unsigned long flags;
> -	int i, cpu;
> -
> -	/* disable local irq to avoid preemption & interruption */
> -	raw_local_irq_save(flags);
> -
> -	cpu = raw_smp_processor_id();
> -	for (i = 0; i < num_possible_cpus(); i++) {
> -		obj = objpool_try_get_slot(pool, cpu);
> -		if (obj)
> -			break;
> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> -	}
> -	raw_local_irq_restore(flags);
> -
> -	return obj;
> -}
> -EXPORT_SYMBOL_GPL(objpool_pop);
> -
>  /* release whole objpool forcely */
>  void objpool_free(struct objpool_head *pool)
>  {
wuqiang.matt May 10, 2024, 7:59 a.m. UTC | #2
On 2024/5/7 21:55, Vlastimil Babka wrote:
> On 4/24/24 11:52 PM, Andrii Nakryiko wrote:
>> objpool_push() and objpool_pop() are very performance-critical functions
>> and can be called very frequently in kretprobe triggering path.
>>
>> As such, it makes sense to allow compiler to inline them completely to
>> eliminate function calls overhead. Luckily, their logic is quite well
>> isolated and doesn't have any sprawling dependencies.
>>
>> This patch moves both objpool_push() and objpool_pop() into
>> include/linux/objpool.h and marks them as static inline functions,
>> enabling inlining. To avoid anyone using internal helpers
>> (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading
>> underscores.
>>
>> We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe
>> and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi
>> programs in a tight loop to evaluate the effect. BPF own overhead in
>> this case is minimal and it mostly stresses the rest of in-kernel
>> kretprobe infrastructure overhead. Results are in millions of calls per
>> second. This is not super scientific, but shows the trend nevertheless.
>>
>> BEFORE
>> ======
>> kretprobe      :    9.794 ± 0.086M/s
>> kretprobe-multi:   10.219 ± 0.032M/s
>>
>> AFTER
>> =====
>> kretprobe      :    9.937 ± 0.174M/s (+1.5%)
>> kretprobe-multi:   10.440 ± 0.108M/s (+2.2%)
>>
>> Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com>
>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> 
> Hello,
> 
Hi Vlastimil,

> this question is not specific to your patch, but since it's a recent thread,
> I'll ask it here instead of digging up the original objpool patches.
> 
> I'm trying to understand how objpool works and if it could be integrated
> into SLUB, for the LSF/MM discussion next week:
> 
> https://lore.kernel.org/all/b929d5fb-8e88-4f23-8ec7-6bdaf61f84f9@suse.cz/
> 
>> +/* adding object to slot, abort if the slot was already full */
> 
> I don't see any actual abort in the code (not in this code nor in the
> deleted code - it's the same code, just moved for inlining purposes).
> 
>> +static inline int
>> +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>> +{
>> +	struct objpool_slot *slot = pool->cpu_slots[cpu];
>> +	uint32_t head, tail;
>> +
>> +	/* loading tail and head as a local snapshot, tail first */
>> +	tail = READ_ONCE(slot->tail);
>> +
>> +	do {
>> +		head = READ_ONCE(slot->head);
>> +		/* fault caught: something must be wrong */
>> +		WARN_ON_ONCE(tail - head > pool->nr_objs);
> 
> So this will only WARN if we go over the capacity, but continue and
> overwrite a pointer that was already there, effectively leaking said object, no?

Yes, the WARN is only for error-catch in caller side (try to push one
same object multiple times for example).

> 
>> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
>> +
>> +	/* now the tail position is reserved for the given obj */
>> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>> +	/* update sequence to make this obj available for pop() */
>> +	smp_store_release(&slot->last, tail + 1);
>> +
>> +	return 0;
>> +}
>>   
>>   /**
>>    * objpool_push() - reclaim the object and return back to objpool
>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool);
>>    * return: 0 or error code (it fails only when user tries to push
>>    * the same object multiple times or wrong "objects" into objpool)
>>    */
>> -int objpool_push(void *obj, struct objpool_head *pool);
>> +static inline int objpool_push(void *obj, struct objpool_head *pool)
>> +{
>> +	unsigned long flags;
>> +	int rc;
>> +
>> +	/* disable local irq to avoid preemption & interruption */
>> +	raw_local_irq_save(flags);
>> +	rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> 
> And IIUC, we could in theory objpool_pop() on one cpu, then later another
> cpu might do objpool_push() and cause the latter cpu's pool to go over
> capacity? Is there some implicit requirements of objpool users to take care
> of having matched cpu for pop and push? Are the current objpool users
> obeying this requirement? (I can see the selftests do, not sure about the
> actual users).
> Or am I missing something? Thanks.

The objects are all pre-allocated along with creation of the new objpool
and the total number of objects never exceeds the capacity on local node.
So objpool_push() would always find an available slot from the ring-array
for the given object to insert back. objpool_pop() would try looping all
the percpu slots until an object is found or whole objpool is empty.

Currently kretprobe is the only actual usecase of objpool.

I'm testing an updated objpool in our HIDS project for critical pathes,
which is widely deployed on servers inside my company. The new version
eliminates the raw_local_irq_save and raw_local_irq_restore pair of
objpool_push and gains up to 5% of performance boost.

> 
>> +	raw_local_irq_restore(flags);
>> +
>> +	return rc;
>> +}
>> +
>>   
>>   /**
>>    * objpool_drop() - discard the object and deref objpool
>> diff --git a/lib/objpool.c b/lib/objpool.c
>> index cfdc02420884..f696308fc026 100644
>> --- a/lib/objpool.c
>> +++ b/lib/objpool.c
>> @@ -152,106 +152,6 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>   }
>>   EXPORT_SYMBOL_GPL(objpool_init);
>>   
>> -/* adding object to slot, abort if the slot was already full */
>> -static inline int
>> -objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>> -{
>> -	struct objpool_slot *slot = pool->cpu_slots[cpu];
>> -	uint32_t head, tail;
>> -
>> -	/* loading tail and head as a local snapshot, tail first */
>> -	tail = READ_ONCE(slot->tail);
>> -
>> -	do {
>> -		head = READ_ONCE(slot->head);
>> -		/* fault caught: something must be wrong */
>> -		WARN_ON_ONCE(tail - head > pool->nr_objs);
>> -	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
>> -
>> -	/* now the tail position is reserved for the given obj */
>> -	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>> -	/* update sequence to make this obj available for pop() */
>> -	smp_store_release(&slot->last, tail + 1);
>> -
>> -	return 0;
>> -}
>> -
>> -/* reclaim an object to object pool */
>> -int objpool_push(void *obj, struct objpool_head *pool)
>> -{
>> -	unsigned long flags;
>> -	int rc;
>> -
>> -	/* disable local irq to avoid preemption & interruption */
>> -	raw_local_irq_save(flags);
>> -	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>> -	raw_local_irq_restore(flags);
>> -
>> -	return rc;
>> -}
>> -EXPORT_SYMBOL_GPL(objpool_push);
>> -
>> -/* try to retrieve object from slot */
>> -static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
>> -{
>> -	struct objpool_slot *slot = pool->cpu_slots[cpu];
>> -	/* load head snapshot, other cpus may change it */
>> -	uint32_t head = smp_load_acquire(&slot->head);
>> -
>> -	while (head != READ_ONCE(slot->last)) {
>> -		void *obj;
>> -
>> -		/*
>> -		 * data visibility of 'last' and 'head' could be out of
>> -		 * order since memory updating of 'last' and 'head' are
>> -		 * performed in push() and pop() independently
>> -		 *
>> -		 * before any retrieving attempts, pop() must guarantee
>> -		 * 'last' is behind 'head', that is to say, there must
>> -		 * be available objects in slot, which could be ensured
>> -		 * by condition 'last != head && last - head <= nr_objs'
>> -		 * that is equivalent to 'last - head - 1 < nr_objs' as
>> -		 * 'last' and 'head' are both unsigned int32
>> -		 */
>> -		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
>> -			head = READ_ONCE(slot->head);
>> -			continue;
>> -		}
>> -
>> -		/* obj must be retrieved before moving forward head */
>> -		obj = READ_ONCE(slot->entries[head & slot->mask]);
>> -
>> -		/* move head forward to mark it's consumption */
>> -		if (try_cmpxchg_release(&slot->head, &head, head + 1))
>> -			return obj;
>> -	}
>> -
>> -	return NULL;
>> -}
>> -
>> -/* allocate an object from object pool */
>> -void *objpool_pop(struct objpool_head *pool)
>> -{
>> -	void *obj = NULL;
>> -	unsigned long flags;
>> -	int i, cpu;
>> -
>> -	/* disable local irq to avoid preemption & interruption */
>> -	raw_local_irq_save(flags);
>> -
>> -	cpu = raw_smp_processor_id();
>> -	for (i = 0; i < num_possible_cpus(); i++) {
>> -		obj = objpool_try_get_slot(pool, cpu);
>> -		if (obj)
>> -			break;
>> -		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>> -	}
>> -	raw_local_irq_restore(flags);
>> -
>> -	return obj;
>> -}
>> -EXPORT_SYMBOL_GPL(objpool_pop);
>> -
>>   /* release whole objpool forcely */
>>   void objpool_free(struct objpool_head *pool)
>>   {
> 

Regards,
Matt Wu
Vlastimil Babka May 10, 2024, 8:20 a.m. UTC | #3
On 5/10/24 9:59 AM, wuqiang.matt wrote:
> On 2024/5/7 21:55, Vlastimil Babka wrote:
 >>
>>> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
>>> +
>>> +	/* now the tail position is reserved for the given obj */
>>> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>> +	/* update sequence to make this obj available for pop() */
>>> +	smp_store_release(&slot->last, tail + 1);
>>> +
>>> +	return 0;
>>> +}
>>>   
>>>   /**
>>>    * objpool_push() - reclaim the object and return back to objpool
>>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool);
>>>    * return: 0 or error code (it fails only when user tries to push
>>>    * the same object multiple times or wrong "objects" into objpool)
>>>    */
>>> -int objpool_push(void *obj, struct objpool_head *pool);
>>> +static inline int objpool_push(void *obj, struct objpool_head *pool)
>>> +{
>>> +	unsigned long flags;
>>> +	int rc;
>>> +
>>> +	/* disable local irq to avoid preemption & interruption */
>>> +	raw_local_irq_save(flags);
>>> +	rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>> 
>> And IIUC, we could in theory objpool_pop() on one cpu, then later another
>> cpu might do objpool_push() and cause the latter cpu's pool to go over
>> capacity? Is there some implicit requirements of objpool users to take care
>> of having matched cpu for pop and push? Are the current objpool users
>> obeying this requirement? (I can see the selftests do, not sure about the
>> actual users).
>> Or am I missing something? Thanks.
> 
> The objects are all pre-allocated along with creation of the new objpool
> and the total number of objects never exceeds the capacity on local node.

Aha, I see, the capacity of entries is enough to hold objects from all nodes
in the most unfortunate case they all end up freed from a single cpu.

> So objpool_push() would always find an available slot from the ring-array
> for the given object to insert back. objpool_pop() would try looping all
> the percpu slots until an object is found or whole objpool is empty.

So it's correct, but seems rather wasteful to have the whole capacity for
entries replicated on every cpu? It does make objpool_push() simple and
fast, but as you say, objpool_pop() still has to search potentially all
non-local percpu slots, with disabled irqs, which is far from ideal.

And the "abort if the slot was already full" comment for
objpool_try_add_slot() seems still misleading? Maybe that was your initial
idea but changed later?

> Currently kretprobe is the only actual usecase of objpool.
> 
> I'm testing an updated objpool in our HIDS project for critical pathes,
> which is widely deployed on servers inside my company. The new version
> eliminates the raw_local_irq_save and raw_local_irq_restore pair of
> objpool_push and gains up to 5% of performance boost.

Mind Ccing me and linux-mm once you are posting that?

Thanks,
Vlastimil
wuqiang.matt May 10, 2024, 9:15 a.m. UTC | #4
On 2024/5/10 16:20, Vlastimil Babka wrote:
> On 5/10/24 9:59 AM, wuqiang.matt wrote:
>> On 2024/5/7 21:55, Vlastimil Babka wrote:
>   >>
>>>> +	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
>>>> +
>>>> +	/* now the tail position is reserved for the given obj */
>>>> +	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>>> +	/* update sequence to make this obj available for pop() */
>>>> +	smp_store_release(&slot->last, tail + 1);
>>>> +
>>>> +	return 0;
>>>> +}
>>>>    
>>>>    /**
>>>>     * objpool_push() - reclaim the object and return back to objpool
>>>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool);
>>>>     * return: 0 or error code (it fails only when user tries to push
>>>>     * the same object multiple times or wrong "objects" into objpool)
>>>>     */
>>>> -int objpool_push(void *obj, struct objpool_head *pool);
>>>> +static inline int objpool_push(void *obj, struct objpool_head *pool)
>>>> +{
>>>> +	unsigned long flags;
>>>> +	int rc;
>>>> +
>>>> +	/* disable local irq to avoid preemption & interruption */
>>>> +	raw_local_irq_save(flags);
>>>> +	rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>>>
>>> And IIUC, we could in theory objpool_pop() on one cpu, then later another
>>> cpu might do objpool_push() and cause the latter cpu's pool to go over
>>> capacity? Is there some implicit requirements of objpool users to take care
>>> of having matched cpu for pop and push? Are the current objpool users
>>> obeying this requirement? (I can see the selftests do, not sure about the
>>> actual users).
>>> Or am I missing something? Thanks.
>>
>> The objects are all pre-allocated along with creation of the new objpool
>> and the total number of objects never exceeds the capacity on local node.
> 
> Aha, I see, the capacity of entries is enough to hold objects from all nodes
> in the most unfortunate case they all end up freed from a single cpu.
> 
>> So objpool_push() would always find an available slot from the ring-array
>> for the given object to insert back. objpool_pop() would try looping all
>> the percpu slots until an object is found or whole objpool is empty.
> 
> So it's correct, but seems rather wasteful to have the whole capacity for
> entries replicated on every cpu? It does make objpool_push() simple and
> fast, but as you say, objpool_pop() still has to search potentially all
> non-local percpu slots, with disabled irqs, which is far from ideal.

Yes, it's a trade-off between performance and memory usage, with a slight
increase of memory consumption for a significant improvement of performance.

The reason of disabling local irqs is objpool uses a 32bit sequence number
as the state description of each element. It could likely overflow and go
back with the same value for extreme cases. 64bit value could eliminate the
collision but seems too heavy.

> And the "abort if the slot was already full" comment for
> objpool_try_add_slot() seems still misleading? Maybe that was your initial
> idea but changed later?

Right, the comments are just left unchanged during iterations. The original
implementation kept each percpu ring-array very compact and objpool_push will
try looping all cpu nodes to return the given object to objpool.

Actually my new update would remove objpool_try_add_slot and integrate it's 
functionality into objpool_push. I'll submit the new patch when I finish the
verification.

> 
>> Currently kretprobe is the only actual usecase of objpool.
>>
>> I'm testing an updated objpool in our HIDS project for critical pathes,
>> which is widely deployed on servers inside my company. The new version
>> eliminates the raw_local_irq_save and raw_local_irq_restore pair of
>> objpool_push and gains up to 5% of performance boost.
> 
> Mind Ccing me and linux-mm once you are posting that?

Sure, I'll make sure to let you know.

> Thanks,
> Vlastimil
> 

Regards,
Matt Wu
diff mbox series

Patch

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index 15aff4a17f0c..d8b1f7b91128 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -5,6 +5,10 @@ 
 
 #include <linux/types.h>
 #include <linux/refcount.h>
+#include <linux/atomic.h>
+#include <linux/cpumask.h>
+#include <linux/irqflags.h>
+#include <linux/smp.h>
 
 /*
  * objpool: ring-array based lockless MPMC queue
@@ -118,13 +122,94 @@  int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
 		 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
 		 objpool_fini_cb release);
 
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_head *pool, int cpu)
+{
+	struct objpool_slot *slot = pool->cpu_slots[cpu];
+	/* load head snapshot, other cpus may change it */
+	uint32_t head = smp_load_acquire(&slot->head);
+
+	while (head != READ_ONCE(slot->last)) {
+		void *obj;
+
+		/*
+		 * data visibility of 'last' and 'head' could be out of
+		 * order since memory updating of 'last' and 'head' are
+		 * performed in push() and pop() independently
+		 *
+		 * before any retrieving attempts, pop() must guarantee
+		 * 'last' is behind 'head', that is to say, there must
+		 * be available objects in slot, which could be ensured
+		 * by condition 'last != head && last - head <= nr_objs'
+		 * that is equivalent to 'last - head - 1 < nr_objs' as
+		 * 'last' and 'head' are both unsigned int32
+		 */
+		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
+			head = READ_ONCE(slot->head);
+			continue;
+		}
+
+		/* obj must be retrieved before moving forward head */
+		obj = READ_ONCE(slot->entries[head & slot->mask]);
+
+		/* move head forward to mark it's consumption */
+		if (try_cmpxchg_release(&slot->head, &head, head + 1))
+			return obj;
+	}
+
+	return NULL;
+}
+
 /**
  * objpool_pop() - allocate an object from objpool
  * @pool: object pool
  *
  * return value: object ptr or NULL if failed
  */
-void *objpool_pop(struct objpool_head *pool);
+static inline void *objpool_pop(struct objpool_head *pool)
+{
+	void *obj = NULL;
+	unsigned long flags;
+	int i, cpu;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+
+	cpu = raw_smp_processor_id();
+	for (i = 0; i < num_possible_cpus(); i++) {
+		obj = __objpool_try_get_slot(pool, cpu);
+		if (obj)
+			break;
+		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+	}
+	raw_local_irq_restore(flags);
+
+	return obj;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int
+__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
+{
+	struct objpool_slot *slot = pool->cpu_slots[cpu];
+	uint32_t head, tail;
+
+	/* loading tail and head as a local snapshot, tail first */
+	tail = READ_ONCE(slot->tail);
+
+	do {
+		head = READ_ONCE(slot->head);
+		/* fault caught: something must be wrong */
+		WARN_ON_ONCE(tail - head > pool->nr_objs);
+	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
+
+	/* now the tail position is reserved for the given obj */
+	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
+	/* update sequence to make this obj available for pop() */
+	smp_store_release(&slot->last, tail + 1);
+
+	return 0;
+}
 
 /**
  * objpool_push() - reclaim the object and return back to objpool
@@ -134,7 +219,19 @@  void *objpool_pop(struct objpool_head *pool);
  * return: 0 or error code (it fails only when user tries to push
  * the same object multiple times or wrong "objects" into objpool)
  */
-int objpool_push(void *obj, struct objpool_head *pool);
+static inline int objpool_push(void *obj, struct objpool_head *pool)
+{
+	unsigned long flags;
+	int rc;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+	rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id());
+	raw_local_irq_restore(flags);
+
+	return rc;
+}
+
 
 /**
  * objpool_drop() - discard the object and deref objpool
diff --git a/lib/objpool.c b/lib/objpool.c
index cfdc02420884..f696308fc026 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -152,106 +152,6 @@  int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
 }
 EXPORT_SYMBOL_GPL(objpool_init);
 
-/* adding object to slot, abort if the slot was already full */
-static inline int
-objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
-{
-	struct objpool_slot *slot = pool->cpu_slots[cpu];
-	uint32_t head, tail;
-
-	/* loading tail and head as a local snapshot, tail first */
-	tail = READ_ONCE(slot->tail);
-
-	do {
-		head = READ_ONCE(slot->head);
-		/* fault caught: something must be wrong */
-		WARN_ON_ONCE(tail - head > pool->nr_objs);
-	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
-
-	/* now the tail position is reserved for the given obj */
-	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
-	/* update sequence to make this obj available for pop() */
-	smp_store_release(&slot->last, tail + 1);
-
-	return 0;
-}
-
-/* reclaim an object to object pool */
-int objpool_push(void *obj, struct objpool_head *pool)
-{
-	unsigned long flags;
-	int rc;
-
-	/* disable local irq to avoid preemption & interruption */
-	raw_local_irq_save(flags);
-	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
-	raw_local_irq_restore(flags);
-
-	return rc;
-}
-EXPORT_SYMBOL_GPL(objpool_push);
-
-/* try to retrieve object from slot */
-static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
-{
-	struct objpool_slot *slot = pool->cpu_slots[cpu];
-	/* load head snapshot, other cpus may change it */
-	uint32_t head = smp_load_acquire(&slot->head);
-
-	while (head != READ_ONCE(slot->last)) {
-		void *obj;
-
-		/*
-		 * data visibility of 'last' and 'head' could be out of
-		 * order since memory updating of 'last' and 'head' are
-		 * performed in push() and pop() independently
-		 *
-		 * before any retrieving attempts, pop() must guarantee
-		 * 'last' is behind 'head', that is to say, there must
-		 * be available objects in slot, which could be ensured
-		 * by condition 'last != head && last - head <= nr_objs'
-		 * that is equivalent to 'last - head - 1 < nr_objs' as
-		 * 'last' and 'head' are both unsigned int32
-		 */
-		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
-			head = READ_ONCE(slot->head);
-			continue;
-		}
-
-		/* obj must be retrieved before moving forward head */
-		obj = READ_ONCE(slot->entries[head & slot->mask]);
-
-		/* move head forward to mark it's consumption */
-		if (try_cmpxchg_release(&slot->head, &head, head + 1))
-			return obj;
-	}
-
-	return NULL;
-}
-
-/* allocate an object from object pool */
-void *objpool_pop(struct objpool_head *pool)
-{
-	void *obj = NULL;
-	unsigned long flags;
-	int i, cpu;
-
-	/* disable local irq to avoid preemption & interruption */
-	raw_local_irq_save(flags);
-
-	cpu = raw_smp_processor_id();
-	for (i = 0; i < num_possible_cpus(); i++) {
-		obj = objpool_try_get_slot(pool, cpu);
-		if (obj)
-			break;
-		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
-	}
-	raw_local_irq_restore(flags);
-
-	return obj;
-}
-EXPORT_SYMBOL_GPL(objpool_pop);
-
 /* release whole objpool forcely */
 void objpool_free(struct objpool_head *pool)
 {