diff mbox series

[bpf-next,v5,1/3] bpf: introduce timeout hash map

Message ID 20210122205415.113822-2-xiyou.wangcong@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: introduce timeout hash map | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 5 maintainers not CCed: quentin@isovalent.com songliubraving@fb.com kpsingh@kernel.org john.fastabend@gmail.com yhs@fb.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit fail Errors and warnings before: 12238 this patch: 12241
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: Alignment should match open parenthesis WARNING: line length of 86 exceeds 80 columns
netdev/build_allmodconfig_warn fail Errors and warnings before: 12886 this patch: 12889
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Cong Wang Jan. 22, 2021, 8:54 p.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

This borrows the idea from conntrack and will be used for conntrack in
ebpf too. Each element in a timeout map has a user-specified timeout
in msecs, after it expires it will be automatically removed from the
map. Cilium already does the same thing, it uses a regular map or LRU
map to track connections and has its own GC in user-space. This does
not scale well when we have millions of connections, as each removal
needs a syscall. Even if we could batch the operations, it still needs
to copy a lot of data between kernel and user space.

There are two cases to consider here:

1. When the timeout map is idle, i.e. no one updates or accesses it,
   we rely on the delayed work to scan the whole hash table and remove
   these expired. The delayed work is scheduled every 1 sec when idle,
   which is also what conntrack uses. It is fine to scan the whole
   table as we do not actually remove elements during this scan,
   instead we simply queue them to the lockless list and defer all the
   removals to the next schedule.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work automatically scans them, we can
   simply skip them and schedule the delayed work immediately to do
   the actual removals. We have to avoid taking locks on fast path.

The timeout of an element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
as there are only a few bits are used currently. Note, a zero timeout
means to expire immediately.

To avoid adding memory overhead to regular map, we have to reuse some
field in struct htab_elem, that is, lru_node. Otherwise we would have
to rewrite a lot of code.

For now, batch ops is not supported, we can add it later if needed.

Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   5 +-
 kernel/bpf/hashtab.c           | 274 +++++++++++++++++++++++++++++++--
 kernel/bpf/syscall.c           |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 269 insertions(+), 15 deletions(-)

Comments

Daniel Borkmann Jan. 26, 2021, 10:04 p.m. UTC | #1
On 1/22/21 9:54 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> This borrows the idea from conntrack and will be used for conntrack in
> ebpf too. Each element in a timeout map has a user-specified timeout
> in msecs, after it expires it will be automatically removed from the
> map. Cilium already does the same thing, it uses a regular map or LRU
> map to track connections and has its own GC in user-space. This does
> not scale well when we have millions of connections, as each removal
> needs a syscall. Even if we could batch the operations, it still needs
> to copy a lot of data between kernel and user space.
> 
> There are two cases to consider here:
> 
> 1. When the timeout map is idle, i.e. no one updates or accesses it,
>     we rely on the delayed work to scan the whole hash table and remove
>     these expired. The delayed work is scheduled every 1 sec when idle,
>     which is also what conntrack uses. It is fine to scan the whole
>     table as we do not actually remove elements during this scan,
>     instead we simply queue them to the lockless list and defer all the
>     removals to the next schedule.
> 
> 2. When the timeout map is actively accessed, we could reach expired
>     elements before the idle work automatically scans them, we can
>     simply skip them and schedule the delayed work immediately to do
>     the actual removals. We have to avoid taking locks on fast path.
> 
> The timeout of an element can be set or updated via bpf_map_update_elem()
> and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
> as there are only a few bits are used currently. Note, a zero timeout
> means to expire immediately.
> 
> To avoid adding memory overhead to regular map, we have to reuse some
> field in struct htab_elem, that is, lru_node. Otherwise we would have
> to rewrite a lot of code.
> 
> For now, batch ops is not supported, we can add it later if needed.

Back in earlier conversation [0], I mentioned also LRU map flavors and to look
into adding a flag, so we wouldn't need new BPF_MAP_TYPE_TIMEOUT_HASH/*LRU types
that replicate existing types once again just with the timeout in addition, so
UAPI wise new map type is not great.

Given you mention Cilium above, only for kernels where there is no LRU hash map,
that is < 4.10, we rely on plain hash, everything else LRU + prealloc to mitigate
ddos by refusing to add new entries when full whereas less active ones will be
purged instead. Timeout /only/ for plain hash is less useful overall, did you
sketch a more generic approach in the meantime that would work for all the htab/lru
flavors (and ideally as non-delayed_work based)?

   [0] https://lore.kernel.org/bpf/20201214201118.148126-1-xiyou.wangcong@gmail.com/

[...]
> @@ -1012,6 +1081,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   			copy_map_value_locked(map,
>   					      l_old->key + round_up(key_size, 8),
>   					      value, false);
> +			if (timeout_map)
> +				l_old->expires = msecs_to_expire(timeout);
>   			return 0;
>   		}
>   		/* fall through, grab the bucket lock and lookup again.
> @@ -1020,6 +1091,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   		 */
>   	}
>   
> +again:
>   	ret = htab_lock_bucket(htab, b, hash, &flags);
>   	if (ret)
>   		return ret;
> @@ -1040,26 +1112,41 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   		copy_map_value_locked(map,
>   				      l_old->key + round_up(key_size, 8),
>   				      value, false);
> +		if (timeout_map)
> +			l_old->expires = msecs_to_expire(timeout);
>   		ret = 0;
>   		goto err;
>   	}
>   
>   	l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
> -				l_old);
> +				timeout_map, l_old);
>   	if (IS_ERR(l_new)) {
> -		/* all pre-allocated elements are in use or memory exhausted */
>   		ret = PTR_ERR(l_new);
> +		if (ret == -EAGAIN) {
> +			htab_unlock_bucket(htab, b, hash, flags);
> +			htab_gc_elem(htab, l_old);
> +			mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> +			goto again;

Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?

> +		}
> +		/* all pre-allocated elements are in use or memory exhausted */
>   		goto err;
>   	}
>   
> +	if (timeout_map)
> +		l_new->expires = msecs_to_expire(timeout);
> +
Cong Wang Jan. 27, 2021, 6:59 a.m. UTC | #2
On Tue, Jan 26, 2021 at 2:04 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/22/21 9:54 PM, Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > This borrows the idea from conntrack and will be used for conntrack in
> > ebpf too. Each element in a timeout map has a user-specified timeout
> > in msecs, after it expires it will be automatically removed from the
> > map. Cilium already does the same thing, it uses a regular map or LRU
> > map to track connections and has its own GC in user-space. This does
> > not scale well when we have millions of connections, as each removal
> > needs a syscall. Even if we could batch the operations, it still needs
> > to copy a lot of data between kernel and user space.
> >
> > There are two cases to consider here:
> >
> > 1. When the timeout map is idle, i.e. no one updates or accesses it,
> >     we rely on the delayed work to scan the whole hash table and remove
> >     these expired. The delayed work is scheduled every 1 sec when idle,
> >     which is also what conntrack uses. It is fine to scan the whole
> >     table as we do not actually remove elements during this scan,
> >     instead we simply queue them to the lockless list and defer all the
> >     removals to the next schedule.
> >
> > 2. When the timeout map is actively accessed, we could reach expired
> >     elements before the idle work automatically scans them, we can
> >     simply skip them and schedule the delayed work immediately to do
> >     the actual removals. We have to avoid taking locks on fast path.
> >
> > The timeout of an element can be set or updated via bpf_map_update_elem()
> > and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
> > as there are only a few bits are used currently. Note, a zero timeout
> > means to expire immediately.
> >
> > To avoid adding memory overhead to regular map, we have to reuse some
> > field in struct htab_elem, that is, lru_node. Otherwise we would have
> > to rewrite a lot of code.
> >
> > For now, batch ops is not supported, we can add it later if needed.
>
> Back in earlier conversation [0], I mentioned also LRU map flavors and to look
> into adding a flag, so we wouldn't need new BPF_MAP_TYPE_TIMEOUT_HASH/*LRU types
> that replicate existing types once again just with the timeout in addition, so
> UAPI wise new map type is not great.

Interestingly, Jamal also brought this flag up to me.

The reason why I don't use a flag is that I don't see any other maps need a
timeout. Take the LRU map you mentioned as an example, it evicts elements
based on size, not based on time, I can't think of a case where we need both
time and size based eviction. Another example is array, there is no way to
delete an element from an array, so we can't really expire an element.

Or do you have any use case for a non-regular hash map with timeout?

>
> Given you mention Cilium above, only for kernels where there is no LRU hash map,
> that is < 4.10, we rely on plain hash, everything else LRU + prealloc to mitigate
> ddos by refusing to add new entries when full whereas less active ones will be
> purged instead. Timeout /only/ for plain hash is less useful overall, did you

The difference between LRU and a timeout map is whether we should
drop the least recently used one too when it is full. For conntrack, this is not
a good idea, because when we have a burst of connections, the least recently
used might be just 1-s old, so evicting it out is not a good idea.
With a timeout
map, we just drop all new connection when it is full and wait for some connect
to timeout naturally, aligned with the definition of conntrack.

So it should be a good replacement for LRU map too in terms of conntrack.

> sketch a more generic approach in the meantime that would work for all the htab/lru
> flavors (and ideally as non-delayed_work based)?

If you mean LRU maps may need timeout too, like I explained above, I can't think
of any such use cases.

>
>    [0] https://lore.kernel.org/bpf/20201214201118.148126-1-xiyou.wangcong@gmail.com/
>
> [...]
> > @@ -1012,6 +1081,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
> >                       copy_map_value_locked(map,
> >                                             l_old->key + round_up(key_size, 8),
> >                                             value, false);
> > +                     if (timeout_map)
> > +                             l_old->expires = msecs_to_expire(timeout);
> >                       return 0;
> >               }
> >               /* fall through, grab the bucket lock and lookup again.
> > @@ -1020,6 +1091,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
> >                */
> >       }
> >
> > +again:
> >       ret = htab_lock_bucket(htab, b, hash, &flags);
> >       if (ret)
> >               return ret;
> > @@ -1040,26 +1112,41 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
> >               copy_map_value_locked(map,
> >                                     l_old->key + round_up(key_size, 8),
> >                                     value, false);
> > +             if (timeout_map)
> > +                     l_old->expires = msecs_to_expire(timeout);
> >               ret = 0;
> >               goto err;
> >       }
> >
> >       l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
> > -                             l_old);
> > +                             timeout_map, l_old);
> >       if (IS_ERR(l_new)) {
> > -             /* all pre-allocated elements are in use or memory exhausted */
> >               ret = PTR_ERR(l_new);
> > +             if (ret == -EAGAIN) {
> > +                     htab_unlock_bucket(htab, b, hash, flags);
> > +                     htab_gc_elem(htab, l_old);
> > +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> > +                     goto again;
>
> Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
> in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?

In this case, the old one is scheduled for removal in GC, we just wait for GC
to finally remove it. It won't stall unless GC itself or the worker scheduler is
wrong, both of which should be kernel bugs.

If we don't do this, users would get a -E2BIG when it is not too big. I don't
know a better way to handle this sad situation, maybe returning -EBUSY
to users and let them call again?

Thanks.
Alexei Starovoitov Jan. 27, 2021, 6 p.m. UTC | #3
On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >               ret = PTR_ERR(l_new);
> > > +             if (ret == -EAGAIN) {
> > > +                     htab_unlock_bucket(htab, b, hash, flags);
> > > +                     htab_gc_elem(htab, l_old);
> > > +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> > > +                     goto again;
> >
> > Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
> > in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?
>
> In this case, the old one is scheduled for removal in GC, we just wait for GC
> to finally remove it. It won't stall unless GC itself or the worker scheduler is
> wrong, both of which should be kernel bugs.
>
> If we don't do this, users would get a -E2BIG when it is not too big. I don't
> know a better way to handle this sad situation, maybe returning -EBUSY
> to users and let them call again?

I think using wq for timers is a non-starter.
Tying a hash/lru map with a timer is not a good idea either.

I think timers have to be done as independent objects similar to
how the kernel uses them.
Then there will be no question whether lru or hash map needs it.
The bpf prog author will be able to use timers with either.
The prog will be able to use timers without hash maps too.

I'm proposing a timer map where each object will go through
bpf_timer_setup(timer, callback, flags);
where "callback" is a bpf subprogram.
Corresponding bpf_del_timer and bpf_mod_timer would work the same way
they are in the kernel.
The tricky part is kernel style of using from_timer() to access the
object with additional info.
I think bpf timer map can model it the same way.
At map creation time the value_size will specify the amount of extra
bytes necessary.
Another alternative is to pass an extra data argument to a callback.
Daniel Borkmann Jan. 27, 2021, 10:48 p.m. UTC | #4
On 1/27/21 7:00 PM, Alexei Starovoitov wrote:
> On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>>>                ret = PTR_ERR(l_new);
>>>> +             if (ret == -EAGAIN) {
>>>> +                     htab_unlock_bucket(htab, b, hash, flags);
>>>> +                     htab_gc_elem(htab, l_old);
>>>> +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
>>>> +                     goto again;
>>>
>>> Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
>>> in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?
>>
>> In this case, the old one is scheduled for removal in GC, we just wait for GC
>> to finally remove it. It won't stall unless GC itself or the worker scheduler is
>> wrong, both of which should be kernel bugs.
>>
>> If we don't do this, users would get a -E2BIG when it is not too big. I don't
>> know a better way to handle this sad situation, maybe returning -EBUSY
>> to users and let them call again?
> 
> I think using wq for timers is a non-starter.
> Tying a hash/lru map with a timer is not a good idea either.

Thinking some more, given we have jiffies64 helper and atomic ops for BPF by now,
we would technically only need the ability to delete entries via bpf iter progs
(d6c4503cc296 ("bpf: Implement bpf iterator for hash maps")) which could then be
kicked off from user space at e.g. dynamic intervals which would be the equivalent
for the wq in here. That patch could then be implemented this way. I presume
the ability to delete map entries from bpf iter progs would be generic and useful
enough anyway.

> I think timers have to be done as independent objects similar to
> how the kernel uses them.
> Then there will be no question whether lru or hash map needs it.
> The bpf prog author will be able to use timers with either.
> The prog will be able to use timers without hash maps too.
> 
> I'm proposing a timer map where each object will go through
> bpf_timer_setup(timer, callback, flags);
> where "callback" is a bpf subprogram.
> Corresponding bpf_del_timer and bpf_mod_timer would work the same way
> they are in the kernel.
> The tricky part is kernel style of using from_timer() to access the
> object with additional info.

Would this mean N timer objs for N map elems? I presume not given this could be
racy and would have huge extra mem overhead. Either way, timer obj could work, but
then at the same time you could probably also solve it with the above; it's not
like you need the timer to kick in at some /exact/ time, but rather at some point
to clean up stale entries before the map gets full and worst case refuses updates
for new entries. (In the ideal case though we wouldn't need the extra effort to
search deeply for elements w/o penalizing the fast-path lookup costs too much when
walking the bucket.)

> I think bpf timer map can model it the same way.
> At map creation time the value_size will specify the amount of extra
> bytes necessary.
> Another alternative is to pass an extra data argument to a callback.
Alexei Starovoitov Jan. 28, 2021, 2:45 a.m. UTC | #5
On Wed, Jan 27, 2021 at 2:48 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/27/21 7:00 PM, Alexei Starovoitov wrote:
> > On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >>>>                ret = PTR_ERR(l_new);
> >>>> +             if (ret == -EAGAIN) {
> >>>> +                     htab_unlock_bucket(htab, b, hash, flags);
> >>>> +                     htab_gc_elem(htab, l_old);
> >>>> +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> >>>> +                     goto again;
> >>>
> >>> Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
> >>> in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?
> >>
> >> In this case, the old one is scheduled for removal in GC, we just wait for GC
> >> to finally remove it. It won't stall unless GC itself or the worker scheduler is
> >> wrong, both of which should be kernel bugs.
> >>
> >> If we don't do this, users would get a -E2BIG when it is not too big. I don't
> >> know a better way to handle this sad situation, maybe returning -EBUSY
> >> to users and let them call again?
> >
> > I think using wq for timers is a non-starter.
> > Tying a hash/lru map with a timer is not a good idea either.
>
> Thinking some more, given we have jiffies64 helper and atomic ops for BPF by now,
> we would technically only need the ability to delete entries via bpf iter progs
> (d6c4503cc296 ("bpf: Implement bpf iterator for hash maps")) which could then be
> kicked off from user space at e.g. dynamic intervals which would be the equivalent
> for the wq in here. That patch could then be implemented this way. I presume
> the ability to delete map entries from bpf iter progs would be generic and useful
> enough anyway.

I think bpf_iter for maps doesn't hold bucket lock anymore, so delete of the
same element should be allowed already. Unless I've mistaken wip patches
with landed ones.
Soon there will be support for bpf_map_for_each_elem() iterator
running from the bpf program itself, so even more ways to do GC like work.

> > I think timers have to be done as independent objects similar to
> > how the kernel uses them.
> > Then there will be no question whether lru or hash map needs it.
> > The bpf prog author will be able to use timers with either.
> > The prog will be able to use timers without hash maps too.
> >
> > I'm proposing a timer map where each object will go through
> > bpf_timer_setup(timer, callback, flags);
> > where "callback" is a bpf subprogram.
> > Corresponding bpf_del_timer and bpf_mod_timer would work the same way
> > they are in the kernel.
> > The tricky part is kernel style of using from_timer() to access the
> > object with additional info.
>
> Would this mean N timer objs for N map elems? I presume not given this could be
> racy and would have huge extra mem overhead.

hmm. Not sure I got the point about overhead.
sizeof(struct timer_list) == 40 bytes.
Not a lot of overhead even if there is a timer object per flow.
When bpf_map_for_each_elem() lands the bpf prog could have one
timer that will kick one a second (or whatever GC period the prog author wants)
and does bpf_map_for_each_elem() once callback fires to delete elems
older than whatever interval the prog needs.
imo that would be a true generic way to combine
bpf_maps, bpf_iters and bpf_timers.

> Either way, timer obj could work, but
> then at the same time you could probably also solve it with the above; it's not
> like you need the timer to kick in at some /exact/ time, but rather at some point
> to clean up stale entries before the map gets full and worst case refuses updates
> for new entries. (In the ideal case though we wouldn't need the extra effort to
> search deeply for elements w/o penalizing the fast-path lookup costs too much when
> walking the bucket.)

Right. I wasn't suggesting to mess with the timer object at every lookup.
My understanding that mod_timer() isn't that fast to call a million
times a second.
bpf progs would have to be smart.

Instead of timer objects as a timer map (a collection of timers that
bpf progs can manipulate)
we can introduce them similar to "struct bpf_spin_lock".
Then "struct bpf_timer foo;" field can be embedded inside any map value
(both hash and array). The verifier work would be a bit trickier than
support for bpf_spin_lock,
but certainly doable.

My main point was that bpf should grow with small composable
building blocks instead of single purpose constructs.
I view hashmap+GC as an example of something that should be split
into smaller blocks.
Cong Wang Jan. 28, 2021, 6:28 a.m. UTC | #6
On Wed, Jan 27, 2021 at 10:00 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >               ret = PTR_ERR(l_new);
> > > > +             if (ret == -EAGAIN) {
> > > > +                     htab_unlock_bucket(htab, b, hash, flags);
> > > > +                     htab_gc_elem(htab, l_old);
> > > > +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> > > > +                     goto again;
> > >
> > > Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
> > > in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?
> >
> > In this case, the old one is scheduled for removal in GC, we just wait for GC
> > to finally remove it. It won't stall unless GC itself or the worker scheduler is
> > wrong, both of which should be kernel bugs.
> >
> > If we don't do this, users would get a -E2BIG when it is not too big. I don't
> > know a better way to handle this sad situation, maybe returning -EBUSY
> > to users and let them call again?
>
> I think using wq for timers is a non-starter.
> Tying a hash/lru map with a timer is not a good idea either.

Both xt_hashlimit and nf_conntrack_core use delayed/deferrable
works, probably since their beginnings. They seem to have started
well. ;)

>
> I think timers have to be done as independent objects similar to
> how the kernel uses them.
> Then there will be no question whether lru or hash map needs it.

Yeah, this probably could make the code easier, but when we have
millions of entries in a map, millions of timers would certainly bring
a lot of CPU overhead (timer interrupt storm?).


> The bpf prog author will be able to use timers with either.
> The prog will be able to use timers without hash maps too.
>
> I'm proposing a timer map where each object will go through
> bpf_timer_setup(timer, callback, flags);
> where "callback" is a bpf subprogram.
> Corresponding bpf_del_timer and bpf_mod_timer would work the same way
> they are in the kernel.
> The tricky part is kernel style of using from_timer() to access the
> object with additional info.
> I think bpf timer map can model it the same way.
> At map creation time the value_size will specify the amount of extra
> bytes necessary.
> Another alternative is to pass an extra data argument to a callback.

Hmm, this idea is very interesting. I still think arming a timer,
whether a kernel timer or a bpf timer, for each entry is overkill,
but we can arm one for each map, something like:

bpf_timer_run(interval, bpf_prog, &any_map);

so we run 'bpf_prog' on any map every 'interval', but the 'bpf_prog'
would have to iterate the whole map during each interval to delete
the expired ones. This is probably doable: the timestamps can be
stored either as a part of key or value, and bpf_jiffies64() is already
available, users would have to discard expired ones after lookup
when they are faster than the timer GC.

Let me take a deeper look tomorrow.

Thanks.
Alexei Starovoitov Jan. 29, 2021, 2:54 a.m. UTC | #7
On Wed, Jan 27, 2021 at 10:28:15PM -0800, Cong Wang wrote:
> On Wed, Jan 27, 2021 at 10:00 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jan 26, 2021 at 11:00 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >               ret = PTR_ERR(l_new);
> > > > > +             if (ret == -EAGAIN) {
> > > > > +                     htab_unlock_bucket(htab, b, hash, flags);
> > > > > +                     htab_gc_elem(htab, l_old);
> > > > > +                     mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> > > > > +                     goto again;
> > > >
> > > > Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
> > > > in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?
> > >
> > > In this case, the old one is scheduled for removal in GC, we just wait for GC
> > > to finally remove it. It won't stall unless GC itself or the worker scheduler is
> > > wrong, both of which should be kernel bugs.
> > >
> > > If we don't do this, users would get a -E2BIG when it is not too big. I don't
> > > know a better way to handle this sad situation, maybe returning -EBUSY
> > > to users and let them call again?
> >
> > I think using wq for timers is a non-starter.
> > Tying a hash/lru map with a timer is not a good idea either.
> 
> Both xt_hashlimit and nf_conntrack_core use delayed/deferrable
> works, probably since their beginnings. They seem to have started
> well. ;)

That code was written when network speed was in Mbits and DDoS abbreviation
wasn't invented. Things are different now.

> > I'm proposing a timer map where each object will go through
> > bpf_timer_setup(timer, callback, flags);
> > where "callback" is a bpf subprogram.
> > Corresponding bpf_del_timer and bpf_mod_timer would work the same way
> > they are in the kernel.
> > The tricky part is kernel style of using from_timer() to access the
> > object with additional info.
> > I think bpf timer map can model it the same way.
> > At map creation time the value_size will specify the amount of extra
> > bytes necessary.
> > Another alternative is to pass an extra data argument to a callback.
> 
> Hmm, this idea is very interesting. I still think arming a timer,
> whether a kernel timer or a bpf timer, for each entry is overkill,
> but we can arm one for each map, something like:
> 
> bpf_timer_run(interval, bpf_prog, &any_map);
> 
> so we run 'bpf_prog' on any map every 'interval', but the 'bpf_prog'
> would have to iterate the whole map during each interval to delete
> the expired ones. This is probably doable: the timestamps can be
> stored either as a part of key or value, and bpf_jiffies64() is already
> available, users would have to discard expired ones after lookup
> when they are faster than the timer GC.

I meant it would look like:

noinline per_elem_callback(map, key, value, ...)
{
  if (value->foo > ...)
    bpf_delete_map_elem(map, key);
}

noinline timer_callback(timer, ctx)
{
  map = ctx->map;
  bpf_for_each_map_elem(map, per_elem_callback, ...);
}

int main_bpf_prog(skb)
{
  bpf_timer_setup(my_timer, timer_callback, ...);
  bpf_mod_timer(my_timer, HZ);
}

The bpf_for_each_map_elem() work is already in progress. Expect patches to hit
mailing list soon.
If you can work on patches for bpf_timer_*() it would be awesome.
Cong Wang Jan. 29, 2021, 5:57 a.m. UTC | #8
On Thu, Jan 28, 2021 at 6:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> I meant it would look like:
>
> noinline per_elem_callback(map, key, value, ...)
> {
>   if (value->foo > ...)
>     bpf_delete_map_elem(map, key);
> }
>
> noinline timer_callback(timer, ctx)
> {
>   map = ctx->map;
>   bpf_for_each_map_elem(map, per_elem_callback, ...);
> }
>
> int main_bpf_prog(skb)
> {
>   bpf_timer_setup(my_timer, timer_callback, ...);
>   bpf_mod_timer(my_timer, HZ);
> }
>
> The bpf_for_each_map_elem() work is already in progress. Expect patches to hit
> mailing list soon.

We don't want a per-element timer, we want a per-map timer but that
requires a way to iterate the whole map. If you or other can provide
bpf_for_each_map_elem(), we can certainly build our timeout map
on top of it.

> If you can work on patches for bpf_timer_*() it would be awesome.

Yeah, I will work on this, not only for timeout map, but also possibly for
the ebpf qdisc I plan to add soon.

Thanks.
Yonghong Song Jan. 29, 2021, 6:21 a.m. UTC | #9
On 1/28/21 9:57 PM, Cong Wang wrote:
> On Thu, Jan 28, 2021 at 6:54 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> I meant it would look like:
>>
>> noinline per_elem_callback(map, key, value, ...)
>> {
>>    if (value->foo > ...)
>>      bpf_delete_map_elem(map, key);
>> }
>>
>> noinline timer_callback(timer, ctx)
>> {
>>    map = ctx->map;
>>    bpf_for_each_map_elem(map, per_elem_callback, ...);
>> }
>>
>> int main_bpf_prog(skb)
>> {
>>    bpf_timer_setup(my_timer, timer_callback, ...);
>>    bpf_mod_timer(my_timer, HZ);
>> }
>>
>> The bpf_for_each_map_elem() work is already in progress. Expect patches to hit
>> mailing list soon.
> 
> We don't want a per-element timer, we want a per-map timer but that
> requires a way to iterate the whole map. If you or other can provide
> bpf_for_each_map_elem(), we can certainly build our timeout map
> on top of it.

I am working on this. Still need a few weeks to post RFC. Will share
as soon as it is in reasonable shape. Thanks!

> 
>> If you can work on patches for bpf_timer_*() it would be awesome.
> 
> Yeah, I will work on this, not only for timeout map, but also possibly for
> the ebpf qdisc I plan to add soon.
> 
> Thanks.
>
Alexei Starovoitov Jan. 30, 2021, 3:14 a.m. UTC | #10
On Fri, Jan 29, 2021 at 6:14 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2021-01-29 9:06 a.m., Jamal Hadi Salim wrote:
>
> > Which leads to:
> > Why not extend the general feature so one can register for optional
> > callbacks not just for expire but also add/del/update on specific
> > entries or table?
> > add/del/update could be sourced from other kernel programs or user space
> > and the callback would be invoked before an entry is added/deleted etc.
> > (just like it is here for expiry).
>
> Sorry - shouldve read the rest of the thread:
> Agree with Cong that you want per-map but there are use cases where you
> want it per entry (eg the add/del/update case).

That was my point as well.
bpf_timer api should be generic, so that users can do both.
The program could use bpf_timer one for each flow and bpf_timer for each map.
And timers without maps.
Jamal Hadi Salim Jan. 31, 2021, 8:35 p.m. UTC | #11
On 2021-01-29 10:14 p.m., Alexei Starovoitov wrote:
> On Fri, Jan 29, 2021 at 6:14 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> On 2021-01-29 9:06 a.m., Jamal Hadi Salim wrote:
>>
>>> Which leads to:
>>> Why not extend the general feature so one can register for optional
>>> callbacks not just for expire but also add/del/update on specific
>>> entries or table?
>>> add/del/update could be sourced from other kernel programs or user space
>>> and the callback would be invoked before an entry is added/deleted etc.
>>> (just like it is here for expiry).
>>
>> Sorry - shouldve read the rest of the thread:
>> Agree with Cong that you want per-map but there are use cases where you
>> want it per entry (eg the add/del/update case).
> 
> That was my point as well.
> bpf_timer api should be generic, so that users can do both.
> The program could use bpf_timer one for each flow and bpf_timer for each map.
> And timers without maps.

I like it. Sensible to also have callback invocations for map
changes i.e entry create/update/delete (maybe map create/destroy).

cheers,
jamal
diff mbox series

Patch

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..00a3b17b6af2 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@  BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c001766adcbc..9c9d8c194b39 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -164,6 +164,7 @@  enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -399,7 +400,9 @@  enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL		1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout for
+ * BPF_MAP_TYPE_TIMEOUT_HASH (in milliseconds).
+ */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
 	BPF_NOEXIST	= 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c1ac7f964bc9..90a899d1706f 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@ 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/llist.h>
+#include <linux/workqueue.h>
 #include <uapi/linux/btf.h>
 #include <linux/rcupdate_trace.h>
 #include "percpu_freelist.h"
@@ -104,6 +106,9 @@  struct bpf_htab {
 	u32 hashrnd;
 	struct lock_class_key lockdep_key;
 	int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+	struct llist_head gc_list;
+	struct delayed_work gc_work;
+	u64 next_gc; /* in jiffies64 */
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -122,6 +127,11 @@  struct htab_elem {
 	union {
 		struct rcu_head rcu;
 		struct bpf_lru_node lru_node;
+		struct {
+			u64 expires; /* in jiffies64 */
+			struct llist_node gc_node;
+			atomic_t pending;
+		};
 	};
 	u32 hash;
 	char key[] __aligned(8);
@@ -429,8 +439,34 @@  static int htab_map_alloc_check(union bpf_attr *attr)
 	return 0;
 }
 
+static bool htab_elem_expired(struct htab_elem *e)
+{
+	return time_after_eq64(get_jiffies_64(), e->expires);
+}
+
+/* Schedule GC to remove an expired element, unless it is already pending. */
+static void htab_gc_elem(struct bpf_htab *htab, struct htab_elem *e)
+{
+	if (atomic_xchg(&e->pending, 1))
+		return;
+	llist_add(&e->gc_node, &htab->gc_list);
+	queue_delayed_work(system_unbound_wq, &htab->gc_work, 0);
+}
+
+/* GC an element if it has been expired, return whether the element is expired
+ * or not.
+ */
+static bool htab_expire_elem(struct bpf_htab *htab, struct htab_elem *e)
+{
+	if (!htab_elem_expired(e))
+		return false;
+	htab_gc_elem(htab, e);
+	return true;
+}
+
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
+	bool timeout_map = attr->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		       attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
 	bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
@@ -509,7 +545,7 @@  static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		if (err)
 			goto free_map_locked;
 
-		if (!percpu && !lru) {
+		if (!percpu && !lru && !timeout_map) {
 			/* lru itself can remove the least used element, so
 			 * there is no need for an extra elem during map_update.
 			 */
@@ -730,6 +766,7 @@  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	bool is_timeout = map->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	struct hlist_nulls_head *head;
 	struct htab_elem *l, *next_l;
 	u32 hash, key_size;
@@ -757,6 +794,8 @@  static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 				  struct htab_elem, hash_node);
 
 	if (next_l) {
+		if (is_timeout && htab_expire_elem(htab, next_l))
+			goto find_first_elem;
 		/* if next elem in this hash list is non-zero, just return it */
 		memcpy(next_key, next_l->key, key_size);
 		return 0;
@@ -775,6 +814,8 @@  static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 		next_l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
 					  struct htab_elem, hash_node);
 		if (next_l) {
+			if (is_timeout && htab_expire_elem(htab, next_l))
+				continue;
 			/* if it's not empty, just return it */
 			memcpy(next_key, next_l->key, key_size);
 			return 0;
@@ -877,6 +918,7 @@  static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab)
 static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 					 void *value, u32 key_size, u32 hash,
 					 bool percpu, bool onallcpus,
+					 bool timeout_map,
 					 struct htab_elem *old_elem)
 {
 	u32 size = htab->map.value_size;
@@ -885,9 +927,11 @@  static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 	void __percpu *pptr;
 
 	if (prealloc) {
-		if (old_elem) {
+		if (old_elem && !timeout_map) {
 			/* if we're updating the existing element,
-			 * use per-cpu extra elems to avoid freelist_pop/push
+			 * use per-cpu extra elems to avoid freelist_pop/push.
+			 * The timeout map is special, we can not simply put it
+			 * back for recycle without going through GC.
 			 */
 			pl_new = this_cpu_ptr(htab->extra_elems);
 			l_new = *pl_new;
@@ -897,8 +941,12 @@  static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			struct pcpu_freelist_node *l;
 
 			l = __pcpu_freelist_pop(&htab->freelist);
-			if (!l)
-				return ERR_PTR(-E2BIG);
+			if (!l) {
+				if (timeout_map && old_elem)
+					return ERR_PTR(-EAGAIN);
+				else
+					return ERR_PTR(-E2BIG);
+			}
 			l_new = container_of(l, struct htab_elem, fnode);
 		}
 	} else {
@@ -952,6 +1000,8 @@  static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			       value);
 	}
 
+	if (timeout_map)
+		atomic_set(&l_new->pending, 0);
 	l_new->hash = hash;
 	return l_new;
 dec_count:
@@ -973,18 +1023,37 @@  static int check_flags(struct bpf_htab *htab, struct htab_elem *l_old,
 	return 0;
 }
 
+static u64 msecs_to_expire(u32 ms)
+{
+	u64 tmp = ms * NSEC_PER_MSEC;
+
+	return nsecs_to_jiffies64(tmp) + get_jiffies_64();
+}
+
+static u32 fetch_timeout(u64 *map_flags)
+{
+	u32 timeout = (*map_flags) >> 32;
+
+	*map_flags = (*map_flags) & 0xffffffff;
+	return timeout;
+}
+
 /* Called from syscall or from eBPF program */
 static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 				u64 map_flags)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	bool timeout_map = map->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	struct htab_elem *l_new = NULL, *l_old;
 	struct hlist_nulls_head *head;
 	unsigned long flags;
 	struct bucket *b;
 	u32 key_size, hash;
+	u32 timeout;
 	int ret;
 
+	timeout = fetch_timeout(&map_flags);
+
 	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
 		return -EINVAL;
@@ -1012,6 +1081,8 @@  static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map,
 					      l_old->key + round_up(key_size, 8),
 					      value, false);
+			if (timeout_map)
+				l_old->expires = msecs_to_expire(timeout);
 			return 0;
 		}
 		/* fall through, grab the bucket lock and lookup again.
@@ -1020,6 +1091,7 @@  static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		 */
 	}
 
+again:
 	ret = htab_lock_bucket(htab, b, hash, &flags);
 	if (ret)
 		return ret;
@@ -1040,26 +1112,41 @@  static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		copy_map_value_locked(map,
 				      l_old->key + round_up(key_size, 8),
 				      value, false);
+		if (timeout_map)
+			l_old->expires = msecs_to_expire(timeout);
 		ret = 0;
 		goto err;
 	}
 
 	l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
-				l_old);
+				timeout_map, l_old);
 	if (IS_ERR(l_new)) {
-		/* all pre-allocated elements are in use or memory exhausted */
 		ret = PTR_ERR(l_new);
+		if (ret == -EAGAIN) {
+			htab_unlock_bucket(htab, b, hash, flags);
+			htab_gc_elem(htab, l_old);
+			mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
+			goto again;
+		}
+		/* all pre-allocated elements are in use or memory exhausted */
 		goto err;
 	}
 
+	if (timeout_map)
+		l_new->expires = msecs_to_expire(timeout);
+
 	/* add new element to the head of the list, so that
 	 * concurrent search will find it before old elem
 	 */
 	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
 	if (l_old) {
-		hlist_nulls_del_rcu(&l_old->hash_node);
-		if (!htab_is_prealloc(htab))
-			free_htab_elem(htab, l_old);
+		if (timeout_map) {
+			htab_gc_elem(htab, l_old);
+		} else {
+			hlist_nulls_del_rcu(&l_old->hash_node);
+			if (!htab_is_prealloc(htab))
+				free_htab_elem(htab, l_old);
+		}
 	}
 	ret = 0;
 err:
@@ -1173,7 +1260,7 @@  static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 				value, onallcpus);
 	} else {
 		l_new = alloc_htab_elem(htab, key, value, key_size,
-					hash, true, onallcpus, NULL);
+					hash, true, onallcpus, false, NULL);
 		if (IS_ERR(l_new)) {
 			ret = PTR_ERR(l_new);
 			goto err;
@@ -1269,6 +1356,7 @@  static int htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 static int htab_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	bool is_timeout = map->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	struct hlist_nulls_head *head;
 	struct bucket *b;
 	struct htab_elem *l;
@@ -1291,8 +1379,14 @@  static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	l = lookup_elem_raw(head, hash, key, key_size);
 
 	if (l) {
-		hlist_nulls_del_rcu(&l->hash_node);
-		free_htab_elem(htab, l);
+		if (is_timeout) {
+			if (htab_elem_expired(l))
+				ret = -ENOENT;
+			htab_gc_elem(htab, l);
+		} else {
+			hlist_nulls_del_rcu(&l->hash_node);
+			free_htab_elem(htab, l);
+		}
 	} else {
 		ret = -ENOENT;
 	}
@@ -2178,3 +2272,157 @@  const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_of_maps_map_btf_id,
 };
+
+#define HTAB_GC_INTERVAL HZ
+
+static void htab_gc(struct work_struct *work)
+{
+	struct htab_elem *e, *tmp;
+	struct llist_node *lhead;
+	struct bpf_htab *htab;
+	int i, count;
+
+	htab = container_of(work, struct bpf_htab, gc_work.work);
+	lhead = llist_del_all(&htab->gc_list);
+
+	llist_for_each_entry_safe(e, tmp, lhead, gc_node) {
+		unsigned long flags;
+		struct bucket *b;
+		u32 hash;
+
+		hash = e->hash;
+		b = __select_bucket(htab, hash);
+		if (htab_lock_bucket(htab, b, hash, &flags))
+			continue;
+		hlist_nulls_del_rcu(&e->hash_node);
+		atomic_set(&e->pending, 0);
+		free_htab_elem(htab, e);
+		htab_unlock_bucket(htab, b, hash, flags);
+
+		cond_resched();
+	}
+
+	if (time_after_eq64(htab->next_gc, get_jiffies_64())) {
+		queue_delayed_work(system_unbound_wq, &htab->gc_work,
+				   htab->next_gc - get_jiffies_64());
+		return;
+	}
+
+	for (count = 0, i = 0; i < htab->n_buckets; i++) {
+		struct hlist_nulls_head *head;
+		struct hlist_nulls_node *n;
+		struct htab_elem *l;
+
+		rcu_read_lock();
+		head = select_bucket(htab, i);
+		hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
+			if (htab_expire_elem(htab, l))
+				count++;
+		rcu_read_unlock();
+
+		cond_resched();
+	}
+	htab->next_gc = get_jiffies_64() + HTAB_GC_INTERVAL;
+
+	queue_delayed_work(system_unbound_wq, &htab->gc_work,
+			   count ? 0 : HTAB_GC_INTERVAL);
+}
+
+static void *__htab_timeout_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l;
+
+	l = __htab_map_lookup_elem(map, key);
+	if (l && htab_expire_elem(htab, l))
+		l = NULL;
+
+	return l;
+}
+
+static void *htab_timeout_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct htab_elem *l = __htab_timeout_map_lookup_elem(map, key);
+
+	if (l)
+		return l->key + round_up(map->key_size, 8);
+	return NULL;
+}
+
+static int htab_timeout_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+	struct bpf_insn *insn = insn_buf;
+	const int ret = BPF_REG_0;
+
+	BUILD_BUG_ON(!__same_type(&__htab_timeout_map_lookup_elem,
+		     (void *(*)(struct bpf_map *map, void *key))NULL));
+	*insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_timeout_map_lookup_elem));
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
+				offsetof(struct htab_elem, key) +
+				round_up(map->key_size, 8));
+	return insn - insn_buf;
+}
+
+static void htab_timeout_map_seq_show_elem(struct bpf_map *map, void *key,
+					   struct seq_file *m)
+{
+	void *value;
+
+	rcu_read_lock();
+
+	value = htab_timeout_map_lookup_elem(map, key);
+	if (!value) {
+		rcu_read_unlock();
+		return;
+	}
+
+	btf_type_seq_show(map->btf, map->btf_key_type_id, key, m);
+	seq_puts(m, ": ");
+	btf_type_seq_show(map->btf, map->btf_value_type_id, value, m);
+	seq_puts(m, "\n");
+
+	rcu_read_unlock();
+}
+
+static struct bpf_map *htab_timeout_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_map *map = htab_map_alloc(attr);
+	struct bpf_htab *htab;
+
+	if (!IS_ERR(map)) {
+		htab = container_of(map, struct bpf_htab, map);
+		init_llist_head(&htab->gc_list);
+		INIT_DEFERRABLE_WORK(&htab->gc_work, htab_gc);
+		htab->next_gc = get_jiffies_64() + HTAB_GC_INTERVAL;
+		queue_delayed_work(system_unbound_wq, &htab->gc_work,
+				   HTAB_GC_INTERVAL);
+	}
+
+	return map;
+}
+
+static void htab_timeout_map_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	cancel_delayed_work_sync(&htab->gc_work);
+	htab_map_free(map);
+}
+
+static int htab_timeout_map_btf_id;
+const struct bpf_map_ops htab_timeout_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = htab_map_alloc_check,
+	.map_alloc = htab_timeout_map_alloc,
+	.map_free = htab_timeout_map_free,
+	.map_get_next_key = htab_map_get_next_key,
+	.map_lookup_elem = htab_timeout_map_lookup_elem,
+	.map_update_elem = htab_map_update_elem,
+	.map_delete_elem = htab_map_delete_elem,
+	.map_gen_lookup = htab_timeout_map_gen_lookup,
+	.map_seq_show_elem = htab_timeout_map_seq_show_elem,
+	.map_btf_name = "bpf_htab",
+	.map_btf_id = &htab_timeout_map_btf_id,
+	.iter_seq_info = &iter_seq_info,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e5999d86c76e..f15b340d33ca 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -777,7 +777,8 @@  static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
-		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_TIMEOUT_HASH)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c001766adcbc..ac6ddfd7bddc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -164,6 +164,7 @@  enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as