mbox series

[RFC,bpf-next,0/3] Add support for qp-trie map

Message ID 20220726130005.3102470-1-houtao1@huawei.com (mailing list archive)
Headers show
Series Add support for qp-trie map | expand

Message

Hou Tao July 26, 2022, 1 p.m. UTC
Hi,

The initial motivation for qp-trie map is to reduce memory usage for
string keys special those with large differencies in length as
discussed in [0]. And as a big-endian lexicographical ordered map, it
can also be used for any binary data with fixed or variable length.

Now the basic functionality of qp-trie is ready, so posting a RFC version
to get more feedback or suggestions about qp-trie. Specially feedback
about the following questions:

(1) Application scenario for qp-trie
Andrii had proposed to re-implement lpm-trie by using qp-trie. The
advantage would be the speed up of lookup operations due to lower tree
depth of qp-trie. Maybe the performance of update could also be improved
although in cillium there is a big lock during lpm-trie update [1]. Is
there any other use cases for qp-trie ? Specially those cases which need
both ordering and memory efficiency or cases in which jhash() of htab
creates too much collisions and qp-trie lookup performances better than
hash-table lookup as shown below:

  Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)

  htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
  htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
  htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
  htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
  htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
  
  qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
  qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
  qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
  qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
  qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)

  * non-zero drops is due to duplicated keys in generated keys.

(2) more fine-grained lock in qp-trie
Now qp-trie is divided into 256 sub-trees by using the first character of
key and one sub-tree is protected one spinlock. From the data below,
although the update/delete speed of qp-trie is slower compare with hash
table, but it scales similar with hash table. So maybe 256-locks is a
good enough solution ?

  Strings in /proc/kallsyms
  htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
  htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
  htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
  htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
  htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
  qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
  qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
  qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
  qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
  qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)

(3) Improve memory efficiency further
When using strings in BTF string section as a data set for qp-trie, the
slab memory usage showed in cgroup memory.stats file is about 11MB for
qp-trie and 15MB for hash table as shown below. However the theoretical
memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
fields from qp_trie_branch) and the extra memory usage (about 38% of total
usage) mainly comes from internal fragment in slab (namely 2^n alignment
for allocation) and overhead in kmem-cgroup accounting. We can reduce the
internal fragment by creating separated kmem_cache for qp_trie_branch with
different child nodes, but not sure whether it is worthy or not.

And in order to prevent allocating a rcu_head for each leaf node, now only
branch node is RCU-freed, so when replacing a leaf node, a new branch node
and a new leaf node will be allocated instead of replacing the old leaf
node and RCU-freed the old leaf node. Also not sure whether or not it is
worthy.

  Strings in BTF string section (entries=115980):
  htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
  qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)

  All files under linux kernel source directory (entries=74359):
  htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
  qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)

  Domain names for Alexa top million web site (entries=1000000):
  htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
  qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)

Comments and suggestions are always welcome.

Regards,
Tao

[0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
[1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123

Hou Tao (3):
  bpf: Add support for qp-trie map
  selftests/bpf: add a simple test for qp-trie
  selftests/bpf: add benchmark for qp-trie map

 include/linux/bpf_types.h                     |    1 +
 include/uapi/linux/bpf.h                      |    8 +
 kernel/bpf/Makefile                           |    1 +
 kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                |    8 +
 tools/testing/selftests/bpf/Makefile          |    5 +-
 tools/testing/selftests/bpf/bench.c           |   10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
 .../selftests/bpf/prog_tests/str_key.c        |   69 ++
 .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
 tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
 12 files changed, 2022 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/bpf_qp_trie.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/str_key.c

Comments

Hou Tao Aug. 2, 2022, 6:20 a.m. UTC | #1
ping ?

On 7/26/2022 9:00 PM, Hou Tao wrote:
> Hi,
>
> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
>
> Now the basic functionality of qp-trie is ready, so posting a RFC version
> to get more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
>
> (1) Application scenario for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie. Maybe the performance of update could also be improved
> although in cillium there is a big lock during lpm-trie update [1]. Is
> there any other use cases for qp-trie ? Specially those cases which need
> both ordering and memory efficiency or cases in which jhash() of htab
> creates too much collisions and qp-trie lookup performances better than
> hash-table lookup as shown below:
>
>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
>
>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
>   
>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
>
>   * non-zero drops is due to duplicated keys in generated keys.
>
> (2) more fine-grained lock in qp-trie
> Now qp-trie is divided into 256 sub-trees by using the first character of
> key and one sub-tree is protected one spinlock. From the data below,
> although the update/delete speed of qp-trie is slower compare with hash
> table, but it scales similar with hash table. So maybe 256-locks is a
> good enough solution ?
>
>   Strings in /proc/kallsyms
>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)
>
> (3) Improve memory efficiency further
> When using strings in BTF string section as a data set for qp-trie, the
> slab memory usage showed in cgroup memory.stats file is about 11MB for
> qp-trie and 15MB for hash table as shown below. However the theoretical
> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
> fields from qp_trie_branch) and the extra memory usage (about 38% of total
> usage) mainly comes from internal fragment in slab (namely 2^n alignment
> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
> internal fragment by creating separated kmem_cache for qp_trie_branch with
> different child nodes, but not sure whether it is worthy or not.
>
> And in order to prevent allocating a rcu_head for each leaf node, now only
> branch node is RCU-freed, so when replacing a leaf node, a new branch node
> and a new leaf node will be allocated instead of replacing the old leaf
> node and RCU-freed the old leaf node. Also not sure whether or not it is
> worthy.
>
>   Strings in BTF string section (entries=115980):
>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
>
>   All files under linux kernel source directory (entries=74359):
>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
>
>   Domain names for Alexa top million web site (entries=1000000):
>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
>
> Comments and suggestions are always welcome.
>
> Regards,
> Tao
>
> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>
> Hou Tao (3):
>   bpf: Add support for qp-trie map
>   selftests/bpf: add a simple test for qp-trie
>   selftests/bpf: add benchmark for qp-trie map
>
>  include/linux/bpf_types.h                     |    1 +
>  include/uapi/linux/bpf.h                      |    8 +
>  kernel/bpf/Makefile                           |    1 +
>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |    8 +
>  tools/testing/selftests/bpf/Makefile          |    5 +-
>  tools/testing/selftests/bpf/bench.c           |   10 +
>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
>  12 files changed, 2022 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/bpf_qp_trie.c
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
>
Andrii Nakryiko Aug. 2, 2022, 10:38 p.m. UTC | #2
On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
>
> Hi,
>

Hey, sorry I'm catching up on upstream and there is just too many
complicated patch sets coming in, so it takes time to get through
them.

I think you did a great job with this implementation, it's certainly
worth submitting as non-RFC for proper upstream review. I know that
some people didn't get your patches, they got into spam somehow. So I
think it would be great to just resubmit it as non-RFC so that it
appears in patchworks as to-be-reviewew patches and hopefully will get
a wider audience to review this.

I've tried to answer some questions below, but would definitely like
more people to chime in. I haven't went through implementation in
details, but superficially it looks pretty clean and certainly ready
for proper non-RFC review.

One point about user API would be to maybe instead use bpf_dynptr as
an interface for specifying variable-sized lookup key instead of
hard-coded
 bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.

In short: looks great, I think it's certainly worth adding this as BPF
map type. Please submit as non-RFC and go through a proper review
process. Looking forward (even if that means reviewing 1000k lines of
dense algorithmic code :) ).

> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
>
> Now the basic functionality of qp-trie is ready, so posting a RFC version
> to get more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
>
> (1) Application scenario for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie. Maybe the performance of update could also be improved
> although in cillium there is a big lock during lpm-trie update [1]. Is

Well, using qp-trie approach as an internal implementation of lpm-trie
is probably a major win already, what's wrong with that?

> there any other use cases for qp-trie ? Specially those cases which need
> both ordering and memory efficiency or cases in which jhash() of htab
> creates too much collisions and qp-trie lookup performances better than
> hash-table lookup as shown below:

I'm thinking about qp-trie as dynamically growable lookup table.
That's a pretty big use case already. There is an RB tree proposal
under review right now which would also satisfy dynamically growable
criteria, but its focus is slightly different and it remains to be
seen how convenient it will be as general-purpose resizable
alternative to hashmap. But from benchmarks I've found online, RB tree
will definitely use more memory than qp-trie.

Ordered property seems also very useful, but I don't yet have specific
use case for that. But once we have data structure like this in BPF,
I'm sure use cases will pop up.

>
>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
>
>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
>
>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
>

Kind of surprised that qp-tire is twice as fast as hashmap. Do you
have any idea why hashmap is slower? Was htab pre-allocated? What was
it's max_entries?

>   * non-zero drops is due to duplicated keys in generated keys.
>
> (2) more fine-grained lock in qp-trie
> Now qp-trie is divided into 256 sub-trees by using the first character of

character -> byte, it's not always strings, right?

> key and one sub-tree is protected one spinlock. From the data below,
> although the update/delete speed of qp-trie is slower compare with hash
> table, but it scales similar with hash table. So maybe 256-locks is a
> good enough solution ?
>
>   Strings in /proc/kallsyms
>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)


qp-trie being slower than htab matches my expectation and what I
observed when trying to use qp-trie with libbpf. But I think for a lot
of cases memory vs CPU tradeoff, coupled with ability to dynamically
grow qp-trie will make this data structure worthwhile.


>
> (3) Improve memory efficiency further
> When using strings in BTF string section as a data set for qp-trie, the
> slab memory usage showed in cgroup memory.stats file is about 11MB for
> qp-trie and 15MB for hash table as shown below. However the theoretical
> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
> fields from qp_trie_branch) and the extra memory usage (about 38% of total
> usage) mainly comes from internal fragment in slab (namely 2^n alignment
> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
> internal fragment by creating separated kmem_cache for qp_trie_branch with
> different child nodes, but not sure whether it is worthy or not.
>

Please CC Paul McKenney (paulmck@kernel.org) for your non-RFC patch
set and maybe he has some good idea how to avoid having rcu_head in
each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed
elements, so that we don't have to keep 16 bytes in each leaf and
branch node?

> And in order to prevent allocating a rcu_head for each leaf node, now only
> branch node is RCU-freed, so when replacing a leaf node, a new branch node
> and a new leaf node will be allocated instead of replacing the old leaf
> node and RCU-freed the old leaf node. Also not sure whether or not it is
> worthy.
>
>   Strings in BTF string section (entries=115980):
>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
>
>   All files under linux kernel source directory (entries=74359):
>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
>
>   Domain names for Alexa top million web site (entries=1000000):
>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
>
> Comments and suggestions are always welcome.
>
> Regards,
> Tao
>
> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>
> Hou Tao (3):
>   bpf: Add support for qp-trie map
>   selftests/bpf: add a simple test for qp-trie
>   selftests/bpf: add benchmark for qp-trie map
>

Overall, looks



>  include/linux/bpf_types.h                     |    1 +
>  include/uapi/linux/bpf.h                      |    8 +
>  kernel/bpf/Makefile                           |    1 +
>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |    8 +
>  tools/testing/selftests/bpf/Makefile          |    5 +-
>  tools/testing/selftests/bpf/bench.c           |   10 +
>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
>  12 files changed, 2022 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/bpf_qp_trie.c
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
>
> --
> 2.29.2
>
Toke Høiland-Jørgensen Aug. 8, 2022, 5:54 p.m. UTC | #3
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
>>
>> Hi,
>>
>
> Hey, sorry I'm catching up on upstream and there is just too many
> complicated patch sets coming in, so it takes time to get through
> them.
>
> I think you did a great job with this implementation, it's certainly
> worth submitting as non-RFC for proper upstream review. I know that
> some people didn't get your patches, they got into spam somehow. So I
> think it would be great to just resubmit it as non-RFC so that it
> appears in patchworks as to-be-reviewew patches and hopefully will get
> a wider audience to review this.
>
> I've tried to answer some questions below, but would definitely like
> more people to chime in. I haven't went through implementation in
> details, but superficially it looks pretty clean and certainly ready
> for proper non-RFC review.
>
> One point about user API would be to maybe instead use bpf_dynptr as
> an interface for specifying variable-sized lookup key instead of
> hard-coded
>  bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.
>
> In short: looks great, I think it's certainly worth adding this as BPF
> map type. Please submit as non-RFC and go through a proper review
> process. Looking forward (even if that means reviewing 1000k lines of
> dense algorithmic code :) ).
>
>> The initial motivation for qp-trie map is to reduce memory usage for
>> string keys special those with large differencies in length as
>> discussed in [0]. And as a big-endian lexicographical ordered map, it
>> can also be used for any binary data with fixed or variable length.
>>
>> Now the basic functionality of qp-trie is ready, so posting a RFC version
>> to get more feedback or suggestions about qp-trie. Specially feedback
>> about the following questions:
>>
>> (1) Application scenario for qp-trie
>> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
>> advantage would be the speed up of lookup operations due to lower tree
>> depth of qp-trie. Maybe the performance of update could also be improved
>> although in cillium there is a big lock during lpm-trie update [1]. Is
>
> Well, using qp-trie approach as an internal implementation of lpm-trie
> is probably a major win already, what's wrong with that?

+1, just improving the LPM trie map type is definitely worthwhile!

>> there any other use cases for qp-trie ? Specially those cases which need
>> both ordering and memory efficiency or cases in which jhash() of htab
>> creates too much collisions and qp-trie lookup performances better than
>> hash-table lookup as shown below:
>
> I'm thinking about qp-trie as dynamically growable lookup table.
> That's a pretty big use case already. There is an RB tree proposal
> under review right now which would also satisfy dynamically growable
> criteria, but its focus is slightly different and it remains to be
> seen how convenient it will be as general-purpose resizable
> alternative to hashmap. But from benchmarks I've found online, RB tree
> will definitely use more memory than qp-trie.
>
> Ordered property seems also very useful, but I don't yet have specific
> use case for that. But once we have data structure like this in BPF,
> I'm sure use cases will pop up.

From the various discussions of the packet queueing schemes (for both
XDP and sch_bpf), it seems some sort of ordered data structure is going
to be useful for that. Not sure this is a great fit for that use case,
but may be worth trying out. Looking harder at this is close to the top
of my list, so will play around with this series as well :)

-Toke
Hou Tao Aug. 9, 2022, 8:25 a.m. UTC | #4
Hi,

On 8/3/2022 6:38 AM, Andrii Nakryiko wrote:
> On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
>> Hi,
>>
> Hey, sorry I'm catching up on upstream and there is just too many
> complicated patch sets coming in, so it takes time to get through
> them.
>
> I think you did a great job with this implementation, it's certainly
> worth submitting as non-RFC for proper upstream review. I know that
> some people didn't get your patches, they got into spam somehow. So I
> think it would be great to just resubmit it as non-RFC so that it
> appears in patchworks as to-be-reviewew patches and hopefully will get
> a wider audience to review this.
Thanks for your encouragement.
The spam email is due to some misconfiguration in email servers of our company,
but it can not be fixed soon, so I change to use another email account and hope
that will fix the spam problem.

> I've tried to answer some questions below, but would definitely like
> more people to chime in. I haven't went through implementation in
> details, but superficially it looks pretty clean and certainly ready
> for proper non-RFC review.
>
> One point about user API would be to maybe instead use bpf_dynptr as
> an interface for specifying variable-sized lookup key instead of
> hard-coded
>  bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.
Will check how to archive that. Does that means we will have real
variable-length key instead of fixed-allocated-length key with
variable-used-length ?
> In short: looks great, I think it's certainly worth adding this as BPF
> map type. Please submit as non-RFC and go through a proper review
> process. Looking forward (even if that means reviewing 1000k lines of
> dense algorithmic code :) ).
Thanks. :)
>> The initial motivation for qp-trie map is to reduce memory usage for
>> string keys special those with large differencies in length as
>> discussed in [0]. And as a big-endian lexicographical ordered map, it
>> can also be used for any binary data with fixed or variable length.
>>
>> Now the basic functionality of qp-trie is ready, so posting a RFC version
>> to get more feedback or suggestions about qp-trie. Specially feedback
>> about the following questions:
>>
>> (1) Application scenario for qp-trie
>> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
>> advantage would be the speed up of lookup operations due to lower tree
>> depth of qp-trie. Maybe the performance of update could also be improved
>> although in cillium there is a big lock during lpm-trie update [1]. Is
> Well, using qp-trie approach as an internal implementation of lpm-trie
> is probably a major win already, what's wrong with that?
I am OK with that. My concern is that you had mentioned that qp-trie could be
used to remove the global lock in lpm-trie, but now there are still subtree
locks in qp-trie, so I am not sure whether the scalability of qp-trie is still a
problem. I had searched users of lpm-trie in github and only found Cilium. In
its source code [0], the update and deletion procedure of lpm-trie are protected
by a big lock, so may be the global lock is not a big issue right now.

[0]:
https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>> there any other use cases for qp-trie ? Specially those cases which need
>> both ordering and memory efficiency or cases in which jhash() of htab
>> creates too much collisions and qp-trie lookup performances better than
>> hash-table lookup as shown below:
> I'm thinking about qp-trie as dynamically growable lookup table.
> That's a pretty big use case already. There is an RB tree proposal
> under review right now which would also satisfy dynamically growable
> criteria, but its focus is slightly different and it remains to be
> seen how convenient it will be as general-purpose resizable
> alternative to hashmap. But from benchmarks I've found online, RB tree
> will definitely use more memory than qp-trie.
qp-trie is also RCU-read safe which is hard for rb-tree to archive. I also hope
qp-trie will have better lookup performance than rb-tree.
> Ordered property seems also very useful, but I don't yet have specific
> use case for that. But once we have data structure like this in BPF,
> I'm sure use cases will pop up.
For ordered map, next() and prev() helpers will be useful, but now these is no
such support. Embedded rb_node in program-defined structure may make it easier
to implement such helpers. Maybe we can also add such helpers for qp-trie in the
next-step ?

>>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
>>
>>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
>>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
>>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
>>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
>>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
>>
>>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
>>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
>>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
>>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
>>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
>>
> Kind of surprised that qp-tire is twice as fast as hashmap. Do you
> have any idea why hashmap is slower? Was htab pre-allocated? What was
> it's max_entries?
The hash table is not pre-allocated and pre-allocation doesn't help for lookup
performance in the benchmark. max_entries is 16K, and if set max_entries to 4K
or 128K, the performance win of qp-trie is almost the same (worse when data set
is bigger).

The better performance is due to two reasons:
(1) height of qp-trie is low due to the randomly-generated data
The top-most branch nodes in qp-trie are almost full and have 16 or 17 child
nodes. If using /proc/kallsyms as input data set, the number of child nodes of
top-most branch nodes are just about 2~4.

(2) large difference between used length of key
The max key size is 256, but the range of valid data length is [1, 255] and the
unused part is zeroed. For htab-lookup, it has to compare 256 bytes each time
during list traverse, but qp-trie only needs to compare the used length of key.

>>   * non-zero drops is due to duplicated keys in generated keys.
>>
>> (2) more fine-grained lock in qp-trie
>> Now qp-trie is divided into 256 sub-trees by using the first character of
> character -> byte, it's not always strings, right?
Yes, qp-trie supports arbitrary bytes array as the key.
>> key and one sub-tree is protected one spinlock. From the data below,
>> although the update/delete speed of qp-trie is slower compare with hash
>> table, but it scales similar with hash table. So maybe 256-locks is a
>> good enough solution ?
>>
>>   Strings in /proc/kallsyms
>>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
>>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
>>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
>>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
>>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
>>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
>>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
>>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
>>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)
> qp-trie being slower than htab matches my expectation and what I
> observed when trying to use qp-trie with libbpf. But I think for a lot
> of cases memory vs CPU tradeoff, coupled with ability to dynamically
> grow qp-trie will make this data structure worthwhile.
>
>
>> (3) Improve memory efficiency further
>> When using strings in BTF string section as a data set for qp-trie, the
>> slab memory usage showed in cgroup memory.stats file is about 11MB for
>> qp-trie and 15MB for hash table as shown below. However the theoretical
>> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
>> fields from qp_trie_branch) and the extra memory usage (about 38% of total
>> usage) mainly comes from internal fragment in slab (namely 2^n alignment
>> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
>> internal fragment by creating separated kmem_cache for qp_trie_branch with
>> different child nodes, but not sure whether it is worthy or not.
>>
> Please CC Paul McKenney (paulmck@kernel.org) for your non-RFC patch
> set and maybe he has some good idea how to avoid having rcu_head in
> each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed
> elements, so that we don't have to keep 16 bytes in each leaf and
> branch node?
Will cc Paul. I found a head-less kfree_rcu() which only needs a pointer, but
its downside is the calling context needs to sleepable, so it doesn't fit. And
it seems that BPF specific memory allocator from Alexei can also help to remove
rcu_head in bpf map element, right ?
>> And in order to prevent allocating a rcu_head for each leaf node, now only
>> branch node is RCU-freed, so when replacing a leaf node, a new branch node
>> and a new leaf node will be allocated instead of replacing the old leaf
>> node and RCU-freed the old leaf node. Also not sure whether or not it is
>> worthy.
>>
>>   Strings in BTF string section (entries=115980):
>>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
>>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
>>
>>   All files under linux kernel source directory (entries=74359):
>>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
>>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
>>
>>   Domain names for Alexa top million web site (entries=1000000):
>>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
>>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
>>
>> Comments and suggestions are always welcome.
>>
>> Regards,
>> Tao
>>
>> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
>> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
>>
>> Hou Tao (3):
>>   bpf: Add support for qp-trie map
>>   selftests/bpf: add a simple test for qp-trie
>>   selftests/bpf: add benchmark for qp-trie map
>>
>
>>  include/linux/bpf_types.h                     |    1 +
>>  include/uapi/linux/bpf.h                      |    8 +
>>  kernel/bpf/Makefile                           |    1 +
>>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
>>  tools/include/uapi/linux/bpf.h                |    8 +
>>  tools/testing/selftests/bpf/Makefile          |    5 +-
>>  tools/testing/selftests/bpf/bench.c           |   10 +
>>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
>>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
>>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
>>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
>>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
>>  12 files changed, 2022 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/bpf/bpf_qp_trie.c
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
>>
>> --
>> 2.29.2
>>
> .
Andrii Nakryiko Aug. 16, 2022, 5:33 a.m. UTC | #5
On Tue, Aug 9, 2022 at 1:25 AM houtao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/3/2022 6:38 AM, Andrii Nakryiko wrote:
> > On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@huawei.com> wrote:
> >> Hi,
> >>
> > Hey, sorry I'm catching up on upstream and there is just too many
> > complicated patch sets coming in, so it takes time to get through
> > them.
> >
> > I think you did a great job with this implementation, it's certainly
> > worth submitting as non-RFC for proper upstream review. I know that
> > some people didn't get your patches, they got into spam somehow. So I
> > think it would be great to just resubmit it as non-RFC so that it
> > appears in patchworks as to-be-reviewew patches and hopefully will get
> > a wider audience to review this.
> Thanks for your encouragement.
> The spam email is due to some misconfiguration in email servers of our company,
> but it can not be fixed soon, so I change to use another email account and hope
> that will fix the spam problem.
>
> > I've tried to answer some questions below, but would definitely like
> > more people to chime in. I haven't went through implementation in
> > details, but superficially it looks pretty clean and certainly ready
> > for proper non-RFC review.
> >
> > One point about user API would be to maybe instead use bpf_dynptr as
> > an interface for specifying variable-sized lookup key instead of
> > hard-coded
> >  bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr.
> Will check how to archive that. Does that means we will have real
> variable-length key instead of fixed-allocated-length key with
> variable-used-length ?

Yes, check bpf_dynptr patches. We are still working on expanding
bpf_dynptr use cases and API, but all the pieces needed for
variable-sized keys are already in place.

> > In short: looks great, I think it's certainly worth adding this as BPF
> > map type. Please submit as non-RFC and go through a proper review
> > process. Looking forward (even if that means reviewing 1000k lines of
> > dense algorithmic code :) ).
> Thanks. :)
> >> The initial motivation for qp-trie map is to reduce memory usage for
> >> string keys special those with large differencies in length as
> >> discussed in [0]. And as a big-endian lexicographical ordered map, it
> >> can also be used for any binary data with fixed or variable length.
> >>
> >> Now the basic functionality of qp-trie is ready, so posting a RFC version
> >> to get more feedback or suggestions about qp-trie. Specially feedback
> >> about the following questions:
> >>
> >> (1) Application scenario for qp-trie
> >> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> >> advantage would be the speed up of lookup operations due to lower tree
> >> depth of qp-trie. Maybe the performance of update could also be improved
> >> although in cillium there is a big lock during lpm-trie update [1]. Is
> > Well, using qp-trie approach as an internal implementation of lpm-trie
> > is probably a major win already, what's wrong with that?
> I am OK with that. My concern is that you had mentioned that qp-trie could be
> used to remove the global lock in lpm-trie, but now there are still subtree
> locks in qp-trie, so I am not sure whether the scalability of qp-trie is still a
> problem. I had searched users of lpm-trie in github and only found Cilium. In
> its source code [0], the update and deletion procedure of lpm-trie are protected
> by a big lock, so may be the global lock is not a big issue right now.
>

I think typically users just stay away from lpm-trie due to its
slowness. Once this is addressed, we'll have more use cases, as the
semantics of this BPF map is useful in a bunch of practical use cases.
So yes, it's very much an issue, overall.

> [0]:
> https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
> >> there any other use cases for qp-trie ? Specially those cases which need
> >> both ordering and memory efficiency or cases in which jhash() of htab
> >> creates too much collisions and qp-trie lookup performances better than
> >> hash-table lookup as shown below:
> > I'm thinking about qp-trie as dynamically growable lookup table.
> > That's a pretty big use case already. There is an RB tree proposal
> > under review right now which would also satisfy dynamically growable
> > criteria, but its focus is slightly different and it remains to be
> > seen how convenient it will be as general-purpose resizable
> > alternative to hashmap. But from benchmarks I've found online, RB tree
> > will definitely use more memory than qp-trie.
> qp-trie is also RCU-read safe which is hard for rb-tree to archive. I also hope
> qp-trie will have better lookup performance than rb-tree.
> > Ordered property seems also very useful, but I don't yet have specific
> > use case for that. But once we have data structure like this in BPF,
> > I'm sure use cases will pop up.
> For ordered map, next() and prev() helpers will be useful, but now these is no
> such support. Embedded rb_node in program-defined structure may make it easier
> to implement such helpers. Maybe we can also add such helpers for qp-trie in the
> next-step ?
>

See bpf_for_each_map_elem() helper, we already have ability to iterate
BPF map elements in *some* order, but in this case for qp-trie those
elements will be guaranteed to be in sorted order.

> >>   Randomly-generated binary data with variable length (length range=[1, 256] entries=16K)
> >>
> >>   htab lookup      (1  thread)    5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB)
> >>   htab lookup      (2  thread)   10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB)
> >>   htab lookup      (4  thread)   20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB)
> >>   htab lookup      (8  thread)   40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB)
> >>   htab lookup      (16 thread)   81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB)
> >>
> >>   qp-trie lookup   (1  thread)   10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB)
> >>   qp-trie lookup   (2  thread)   20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB)
> >>   qp-trie lookup   (4  thread)   40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB)
> >>   qp-trie lookup   (8  thread)   81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB)
> >>   qp-trie lookup   (16 thread)  159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB)
> >>
> > Kind of surprised that qp-tire is twice as fast as hashmap. Do you
> > have any idea why hashmap is slower? Was htab pre-allocated? What was
> > it's max_entries?
> The hash table is not pre-allocated and pre-allocation doesn't help for lookup
> performance in the benchmark. max_entries is 16K, and if set max_entries to 4K
> or 128K, the performance win of qp-trie is almost the same (worse when data set
> is bigger).
>
> The better performance is due to two reasons:
> (1) height of qp-trie is low due to the randomly-generated data
> The top-most branch nodes in qp-trie are almost full and have 16 or 17 child
> nodes. If using /proc/kallsyms as input data set, the number of child nodes of
> top-most branch nodes are just about 2~4.
>

right, makes sense, dense data with high branching factor is best
scenario for trie-based data structures

> (2) large difference between used length of key
> The max key size is 256, but the range of valid data length is [1, 255] and the
> unused part is zeroed. For htab-lookup, it has to compare 256 bytes each time
> during list traverse, but qp-trie only needs to compare the used length of key.

yep, makes sense as well

>
> >>   * non-zero drops is due to duplicated keys in generated keys.
> >>
> >> (2) more fine-grained lock in qp-trie
> >> Now qp-trie is divided into 256 sub-trees by using the first character of
> > character -> byte, it's not always strings, right?
> Yes, qp-trie supports arbitrary bytes array as the key.
> >> key and one sub-tree is protected one spinlock. From the data below,
> >> although the update/delete speed of qp-trie is slower compare with hash
> >> table, but it scales similar with hash table. So maybe 256-locks is a
> >> good enough solution ?
> >>
> >>   Strings in /proc/kallsyms
> >>   htab update      (1  thread)    2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
> >>   htab update      (2  thread)    4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB)
> >>   htab update      (4  thread)    6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB)
> >>   htab update      (8  thread)    6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB)
> >>   htab update      (16 thread)    6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB)
> >>   qp-trie update   (1  thread)    1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB)
> >>   qp-trie update   (2  thread)    1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB)
> >>   qp-trie update   (4  thread)    2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB)
> >>   qp-trie update   (8  thread)    3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB)
> >>   qp-trie update   (16 thread)    3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB)
> > qp-trie being slower than htab matches my expectation and what I
> > observed when trying to use qp-trie with libbpf. But I think for a lot
> > of cases memory vs CPU tradeoff, coupled with ability to dynamically
> > grow qp-trie will make this data structure worthwhile.
> >
> >
> >> (3) Improve memory efficiency further
> >> When using strings in BTF string section as a data set for qp-trie, the
> >> slab memory usage showed in cgroup memory.stats file is about 11MB for
> >> qp-trie and 15MB for hash table as shown below. However the theoretical
> >> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
> >> fields from qp_trie_branch) and the extra memory usage (about 38% of total
> >> usage) mainly comes from internal fragment in slab (namely 2^n alignment
> >> for allocation) and overhead in kmem-cgroup accounting. We can reduce the
> >> internal fragment by creating separated kmem_cache for qp_trie_branch with
> >> different child nodes, but not sure whether it is worthy or not.
> >>
> > Please CC Paul McKenney (paulmck@kernel.org) for your non-RFC patch
> > set and maybe he has some good idea how to avoid having rcu_head in
> > each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed
> > elements, so that we don't have to keep 16 bytes in each leaf and
> > branch node?
> Will cc Paul. I found a head-less kfree_rcu() which only needs a pointer, but
> its downside is the calling context needs to sleepable, so it doesn't fit. And
> it seems that BPF specific memory allocator from Alexei can also help to remove
> rcu_head in bpf map element, right ?

Hm.. don't know, maybe?

> >> And in order to prevent allocating a rcu_head for each leaf node, now only
> >> branch node is RCU-freed, so when replacing a leaf node, a new branch node
> >> and a new leaf node will be allocated instead of replacing the old leaf
> >> node and RCU-freed the old leaf node. Also not sure whether or not it is
> >> worthy.
> >>
> >>   Strings in BTF string section (entries=115980):
> >>   htab lookup      (1  thread)    9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB)
> >>   qp-trie lookup   (1  thread)    5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB)
> >>
> >>   All files under linux kernel source directory (entries=74359):
> >>   htab lookup      (1  thread)    8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB)
> >>   qp-trie lookup   (1  thread)    4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB)
> >>
> >>   Domain names for Alexa top million web site (entries=1000000):
> >>   htab lookup      (1  thread)    4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB)
> >>   qp-trie lookup   (1  thread)    2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB)
> >>
> >> Comments and suggestions are always welcome.
> >>
> >> Regards,
> >> Tao
> >>
> >> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
> >> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123
> >>
> >> Hou Tao (3):
> >>   bpf: Add support for qp-trie map
> >>   selftests/bpf: add a simple test for qp-trie
> >>   selftests/bpf: add benchmark for qp-trie map
> >>
> >
> >>  include/linux/bpf_types.h                     |    1 +
> >>  include/uapi/linux/bpf.h                      |    8 +
> >>  kernel/bpf/Makefile                           |    1 +
> >>  kernel/bpf/bpf_qp_trie.c                      | 1064 +++++++++++++++++
> >>  tools/include/uapi/linux/bpf.h                |    8 +
> >>  tools/testing/selftests/bpf/Makefile          |    5 +-
> >>  tools/testing/selftests/bpf/bench.c           |   10 +
> >>  .../selftests/bpf/benchs/bench_qp_trie.c      |  499 ++++++++
> >>  .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
> >>  .../selftests/bpf/prog_tests/str_key.c        |   69 ++
> >>  .../selftests/bpf/progs/qp_trie_bench.c       |  218 ++++
> >>  tools/testing/selftests/bpf/progs/str_key.c   |   85 ++
> >>  12 files changed, 2022 insertions(+), 1 deletion(-)
> >>  create mode 100644 kernel/bpf/bpf_qp_trie.c
> >>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
> >>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c
> >>  create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
> >>  create mode 100644 tools/testing/selftests/bpf/progs/str_key.c
> >>
> >> --
> >> 2.29.2
> >>
> > .
>