mbox series

[bpf-next,0/3] Introduce local_storage exclusive caching

Message ID 20220420002143.1096548-1-davemarchevsky@fb.com (mailing list archive)
Headers show
Series Introduce local_storage exclusive caching | expand

Message

Dave Marchevsky April 20, 2022, 12:21 a.m. UTC
Currently, each local_storage type (sk, inode, task) has a 16-entry
cache for local_storage data associated with a particular map. A
local_storage map is assigned a fixed cache_idx when it is allocated.
When looking in a local_storage for data associated with a map the cache
entry at cache_idx is the only place the map can appear in cache. If the
map's data is not in cache it is placed there after a search through the
cache hlist. When there are >16 local_storage maps allocated for a
local_storage type multiple maps have same cache_idx and thus may knock
each other out of cache.

BPF programs that use local_storage may require fast and consistent
local storage access. For example, a BPF program using task local
storage to make scheduling decisions would not be able to tolerate a
long hlist search for its local_storage as this would negatively affect
cycles available to applications. Providing a mechanism for such a
program to ensure that its local_storage_data will always be in cache
would ensure fast access.

This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
set on sk, inode, and task local_storage maps via map_extras. When a map
with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
cache slot that it can't be evicted from until the map is free'd. 

If there are no slots available to exclusively claim, the allocation
fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
only if their data _must_ be in cache.

The existing cache slots are used - as opposed to a separate cache - as
exclusive caching is not expected to be used by the majority of
local_storage BPF programs. So better to avoid adding a separate cache
that will bloat memory and go unused.

Patches:
* Patch 1 implements kernel-side changes to support the feature
* Patch 2 adds selftests validating functionality
* Patch 3 is a oneline #define dedupe

Dave Marchevsky (3):
  bpf: Introduce local_storage exclusive caching option
  selftests/bpf: Add local_storage exclusive cache test
  bpf: Remove duplicate define in bpf_local_storage.h

 include/linux/bpf_local_storage.h             |   8 +-
 include/uapi/linux/bpf.h                      |  14 +++
 kernel/bpf/bpf_inode_storage.c                |  16 ++-
 kernel/bpf/bpf_local_storage.c                |  42 ++++++--
 kernel/bpf/bpf_task_storage.c                 |  16 ++-
 kernel/bpf/syscall.c                          |   7 +-
 net/core/bpf_sk_storage.c                     |  15 ++-
 .../test_local_storage_excl_cache.c           |  52 +++++++++
 .../bpf/progs/local_storage_excl_cache.c      | 100 ++++++++++++++++++
 .../bpf/progs/local_storage_excl_cache_fail.c |  36 +++++++
 10 files changed, 283 insertions(+), 23 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_local_storage_excl_cache.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_excl_cache.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_excl_cache_fail.c

Comments

Alexei Starovoitov April 22, 2022, 1:40 a.m. UTC | #1
On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote:
> Currently, each local_storage type (sk, inode, task) has a 16-entry
> cache for local_storage data associated with a particular map. A
> local_storage map is assigned a fixed cache_idx when it is allocated.
> When looking in a local_storage for data associated with a map the cache
> entry at cache_idx is the only place the map can appear in cache. If the
> map's data is not in cache it is placed there after a search through the
> cache hlist. When there are >16 local_storage maps allocated for a
> local_storage type multiple maps have same cache_idx and thus may knock
> each other out of cache.
> 
> BPF programs that use local_storage may require fast and consistent
> local storage access. For example, a BPF program using task local
> storage to make scheduling decisions would not be able to tolerate a
> long hlist search for its local_storage as this would negatively affect
> cycles available to applications. Providing a mechanism for such a
> program to ensure that its local_storage_data will always be in cache
> would ensure fast access.
> 
> This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
> set on sk, inode, and task local_storage maps via map_extras. When a map
> with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
> cache slot that it can't be evicted from until the map is free'd. 
> 
> If there are no slots available to exclusively claim, the allocation
> fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
> only if their data _must_ be in cache.

I'm afraid new uapi flag doesn't solve this problem.
Sooner or later every bpf program would deem itself "important" and
performance critical. All of them will be using FORCE_CACHE flag
and we will back to the same situation.

Also please share the performance data that shows more than 16 programs
that use local storage at the same time and existing simple cache
replacing logic is not enough.
For any kind link list walking to become an issue there gotta be at
least 17 progs. Two progs should pick up the same cache_idx and
run interleaved to evict each other. 
It feels like unlikely scenario, so real data would be good to see.
If it really an issue we might need a different caching logic.
Like instead of single link list per local storage we might
have 16 link lists. cache_idx can point to a slot.
If it's not 1st it will be a 2nd in much shorter link list.
With 16 slots the link lists will have 2 elements until 32 bpf progs
are using local storage.
We can get rid of cache too and replace with mini hash table of N
elements where map_id would be an index into a hash table.
All sorts of other algorithms are possible.
In any case the bpf user shouldn't be telling the kernel about
"importance" of its program. If program is indeed executing a lot
the kernel should be caching/accelerating it where it can.
Dave Marchevsky April 22, 2022, 4:05 a.m. UTC | #2
On 4/21/22 9:40 PM, Alexei Starovoitov wrote:   
> On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote:
>> Currently, each local_storage type (sk, inode, task) has a 16-entry
>> cache for local_storage data associated with a particular map. A
>> local_storage map is assigned a fixed cache_idx when it is allocated.
>> When looking in a local_storage for data associated with a map the cache
>> entry at cache_idx is the only place the map can appear in cache. If the
>> map's data is not in cache it is placed there after a search through the
>> cache hlist. When there are >16 local_storage maps allocated for a
>> local_storage type multiple maps have same cache_idx and thus may knock
>> each other out of cache.
>>
>> BPF programs that use local_storage may require fast and consistent
>> local storage access. For example, a BPF program using task local
>> storage to make scheduling decisions would not be able to tolerate a
>> long hlist search for its local_storage as this would negatively affect
>> cycles available to applications. Providing a mechanism for such a
>> program to ensure that its local_storage_data will always be in cache
>> would ensure fast access.
>>
>> This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
>> set on sk, inode, and task local_storage maps via map_extras. When a map
>> with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
>> cache slot that it can't be evicted from until the map is free'd. 
>>
>> If there are no slots available to exclusively claim, the allocation
>> fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
>> only if their data _must_ be in cache.
> 
> I'm afraid new uapi flag doesn't solve this problem.
> Sooner or later every bpf program would deem itself "important" and
> performance critical. All of them will be using FORCE_CACHE flag
> and we will back to the same situation.

In this scenario, if 16 maps had been loaded w/ FORCE_CACHE flag and 17th tried
to load, it would fail, so programs depending on the map would fail to load.
Patch 2 adds a selftest 'local_storage_excl_cache_fail' demonstrating this.

> Also please share the performance data that shows more than 16 programs
> that use local storage at the same time and existing simple cache
> replacing logic is not enough.
> For any kind link list walking to become an issue there gotta be at
> least 17 progs. Two progs should pick up the same cache_idx and
> run interleaved to evict each other. 
> It feels like unlikely scenario, so real data would be good to see.
> If it really an issue we might need a different caching logic.
> Like instead of single link list per local storage we might
> have 16 link lists. cache_idx can point to a slot.
> If it's not 1st it will be a 2nd in much shorter link list.
> With 16 slots the link lists will have 2 elements until 32 bpf progs
> are using local storage.
> We can get rid of cache too and replace with mini hash table of N
> elements where map_id would be an index into a hash table.
> All sorts of other algorithms are possible.
> In any case the bpf user shouldn't be telling the kernel about
> "importance" of its program. If program is indeed executing a lot
> the kernel should be caching/accelerating it where it can.

It's worth noting that this is a map-level setting, not prog-level. Telling the
kernel about importance of data feels more palatable to me. Sort of like mmap's
MAP_LOCKED, but for local_storage cache.

Going back to the motivating example - using data in task local_storage to make
scheduling decisions - the desire is to have the task local_storage access be
like "accessing a task_struct member" vs "doing a search for right data to 
access (w/ some caching to try to avoid search)".

Re: performance data, would adding a benchmark in selftests/bpf/benchs work?
Alexei Starovoitov April 22, 2022, 10:07 p.m. UTC | #3
On Fri, Apr 22, 2022 at 12:05:23AM -0400, Dave Marchevsky wrote:
> On 4/21/22 9:40 PM, Alexei Starovoitov wrote:   
> > On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote:
> >> Currently, each local_storage type (sk, inode, task) has a 16-entry
> >> cache for local_storage data associated with a particular map. A
> >> local_storage map is assigned a fixed cache_idx when it is allocated.
> >> When looking in a local_storage for data associated with a map the cache
> >> entry at cache_idx is the only place the map can appear in cache. If the
> >> map's data is not in cache it is placed there after a search through the
> >> cache hlist. When there are >16 local_storage maps allocated for a
> >> local_storage type multiple maps have same cache_idx and thus may knock
> >> each other out of cache.
> >>
> >> BPF programs that use local_storage may require fast and consistent
> >> local storage access. For example, a BPF program using task local
> >> storage to make scheduling decisions would not be able to tolerate a
> >> long hlist search for its local_storage as this would negatively affect
> >> cycles available to applications. Providing a mechanism for such a
> >> program to ensure that its local_storage_data will always be in cache
> >> would ensure fast access.
> >>
> >> This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
> >> set on sk, inode, and task local_storage maps via map_extras. When a map
> >> with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
> >> cache slot that it can't be evicted from until the map is free'd. 
> >>
> >> If there are no slots available to exclusively claim, the allocation
> >> fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
> >> only if their data _must_ be in cache.
> > 
> > I'm afraid new uapi flag doesn't solve this problem.
> > Sooner or later every bpf program would deem itself "important" and
> > performance critical. All of them will be using FORCE_CACHE flag
> > and we will back to the same situation.
> 
> In this scenario, if 16 maps had been loaded w/ FORCE_CACHE flag and 17th tried
> to load, it would fail, so programs depending on the map would fail to load.

Ahh. I missed that the cache_idx is assigned at map creation time.

> Patch 2 adds a selftest 'local_storage_excl_cache_fail' demonstrating this.
> 
> > Also please share the performance data that shows more than 16 programs
> > that use local storage at the same time and existing simple cache
> > replacing logic is not enough.
> > For any kind link list walking to become an issue there gotta be at
> > least 17 progs. Two progs should pick up the same cache_idx and
> > run interleaved to evict each other. 
> > It feels like unlikely scenario, so real data would be good to see.
> > If it really an issue we might need a different caching logic.
> > Like instead of single link list per local storage we might
> > have 16 link lists. cache_idx can point to a slot.
> > If it's not 1st it will be a 2nd in much shorter link list.
> > With 16 slots the link lists will have 2 elements until 32 bpf progs
> > are using local storage.
> > We can get rid of cache too and replace with mini hash table of N
> > elements where map_id would be an index into a hash table.
> > All sorts of other algorithms are possible.
> > In any case the bpf user shouldn't be telling the kernel about
> > "importance" of its program. If program is indeed executing a lot
> > the kernel should be caching/accelerating it where it can.
> 
> It's worth noting that this is a map-level setting, not prog-level. Telling the
> kernel about importance of data feels more palatable to me. Sort of like mmap's
> MAP_LOCKED, but for local_storage cache.

For mmap it's an operational difference. Not just performance.

> Going back to the motivating example - using data in task local_storage to make
> scheduling decisions - the desire is to have the task local_storage access be
> like "accessing a task_struct member" vs "doing a search for right data to 
> access (w/ some caching to try to avoid search)".

Exactly. The motivation is performance. Let's try to make good performance
without uapi flags.
Think from user's pov. They have to pick between FORCE_CACHE == good performance
and no flag == bad? performance.
Why would anyone pick something that has worse performance?
Yosry Ahmed April 23, 2022, 9:43 a.m. UTC | #4
On Thu, Apr 21, 2022 at 6:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote:
> > Currently, each local_storage type (sk, inode, task) has a 16-entry
> > cache for local_storage data associated with a particular map. A
> > local_storage map is assigned a fixed cache_idx when it is allocated.
> > When looking in a local_storage for data associated with a map the cache
> > entry at cache_idx is the only place the map can appear in cache. If the
> > map's data is not in cache it is placed there after a search through the
> > cache hlist. When there are >16 local_storage maps allocated for a
> > local_storage type multiple maps have same cache_idx and thus may knock
> > each other out of cache.
> >
> > BPF programs that use local_storage may require fast and consistent
> > local storage access. For example, a BPF program using task local
> > storage to make scheduling decisions would not be able to tolerate a
> > long hlist search for its local_storage as this would negatively affect
> > cycles available to applications. Providing a mechanism for such a
> > program to ensure that its local_storage_data will always be in cache
> > would ensure fast access.
> >
> > This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
> > set on sk, inode, and task local_storage maps via map_extras. When a map
> > with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
> > cache slot that it can't be evicted from until the map is free'd.
> >
> > If there are no slots available to exclusively claim, the allocation
> > fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
> > only if their data _must_ be in cache.
>
> I'm afraid new uapi flag doesn't solve this problem.
> Sooner or later every bpf program would deem itself "important" and
> performance critical. All of them will be using FORCE_CACHE flag
> and we will back to the same situation.
>
> Also please share the performance data that shows more than 16 programs
> that use local storage at the same time and existing simple cache
> replacing logic is not enough.
> For any kind link list walking to become an issue there gotta be at
> least 17 progs. Two progs should pick up the same cache_idx and
> run interleaved to evict each other.
> It feels like unlikely scenario, so real data would be good to see.
> If it really an issue we might need a different caching logic.
> Like instead of single link list per local storage we might
> have 16 link lists. cache_idx can point to a slot.
> If it's not 1st it will be a 2nd in much shorter link list.
> With 16 slots the link lists will have 2 elements until 32 bpf progs
> are using local storage.
> We can get rid of cache too and replace with mini hash table of N
> elements where map_id would be an index into a hash table.

This is a tangent to this discussion, but I was actually wondering why
the elements in bpf_local_storage are stored in a list rather than a
hashtable. Is there a specific reason for this?

> All sorts of other algorithms are possible.
> In any case the bpf user shouldn't be telling the kernel about
> "importance" of its program. If program is indeed executing a lot
> the kernel should be caching/accelerating it where it can.