mbox series

[PATCHv17,bpf-next,0/6] xdp: add a new helper for dev map multicast support

Message ID 20210125124516.3098129-1-liuhangbin@gmail.com (mailing list archive)
Headers show
Series xdp: add a new helper for dev map multicast support | expand

Message

Hangbin Liu Jan. 25, 2021, 12:45 p.m. UTC
This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
The 2st patch add a new bpf arg to allow NULL map pointer.
The 3rd patch add the new bpf_redirect_map_multi() helper.
The 4-6 patches are for usage sample and testing purpose.

I did same perf tests with the following topo:

---------------------             ---------------------
| Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
---------------------             |                   |
                                  |   Host B          |
---------------------             |                   |
| Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
---------------------    vlan2    |          -------- |
                                  | veth1 -- | veth0| |
                                  |          -------- |
                                  --------------------|
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
The veth0 in netns load dummy drop program. The forward_map max_entries in
xdp_redirect_map_multi is modify to 4.

Here is the perf result with 5.10 rc6:

The are about +/- 0.1M deviation for native testing
Version             | Test                                    | Generic | Native | Native + 2nd
5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v17:
For patch 01:
a) rename to_sent to to_send.
b) clear bq dev_rx, xdp_prog and flush_node in __dev_flush().

v16:
refactor bq_xmit_all logic and remove error label for patch 01

v15:
Update bq_xmit_all() logic for patch 01.
Add some comments and remove useless variable for patch 03.
Use bpf_object__find_program_by_title() for patch 04 and 06.

v14:
No code update, just rebase the code on latest bpf-next

v13:
Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
code in patch 03.

v12:
Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
Add 2nd xdp_prog test on the sample and selftests.

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Rebase the code to latest bpf-next.
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
a) Update function dev_in_exclude_map():
   - remove duplicate ex_map map_type check in
   - lookup the element in dev map by obj dev index directly instead
     of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (5):
  bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: Add verifier tests for bpf arg
    ARG_CONST_MAP_PTR_OR_NULL
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  21 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  28 ++
 kernel/bpf/devmap.c                           | 262 +++++++++++----
 kernel/bpf/verifier.c                         |  16 +-
 net/core/filter.c                             | 124 ++++++-
 net/core/xdp.c                                |  29 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 302 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  28 ++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       | 111 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  22 +-
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
 .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
 .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
 18 files changed, 1501 insertions(+), 67 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

Comments

Hangbin Liu Feb. 4, 2021, 12:14 a.m. UTC | #1
Hi Daniel, Alexei,

It has been one week after Maciej, Toke, John's review/ack. What should
I do to make a progress for this patch set?

Thanks
Hangbin
On Mon, Jan 25, 2021 at 08:45:10PM +0800, Hangbin Liu wrote:
> This patch is for xdp multicast support. which has been discussed before[0],
> The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
> a software switch that can forward XDP frames to multiple ports.
> 
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
> 
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because there
> may have multi interfaces you want to exclude.
> 
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.
> 
> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
> to accept two maps, the forwarding map and exclude map. If user
> don't want to use exclude map and just want simply stop redirecting back
> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.
> 
> The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
> The 2st patch add a new bpf arg to allow NULL map pointer.
> The 3rd patch add the new bpf_redirect_map_multi() helper.
> The 4-6 patches are for usage sample and testing purpose.
> 
> I did same perf tests with the following topo:
> 
> ---------------------             ---------------------
> | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
> ---------------------             |                   |
>                                   |   Host B          |
> ---------------------             |                   |
> | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
> ---------------------    vlan2    |          -------- |
>                                   | veth1 -- | veth0| |
>                                   |          -------- |
>                                   --------------------|
> On Host A:
> # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
> 
> On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
> Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
> The veth0 in netns load dummy drop program. The forward_map max_entries in
> xdp_redirect_map_multi is modify to 4.
> 
> Here is the perf result with 5.10 rc6:
> 
> The are about +/- 0.1M deviation for native testing
> Version             | Test                                    | Generic | Native | Native + 2nd
> 5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
> 5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
> 5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
> 5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
> 5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
> 5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
> 5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
> 5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
> 5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M
> 
> Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
> suggestions and help on implementation.
> 
> [0] https://xdp-project.net/#Handling-multicast
> 
> v17:
> For patch 01:
> a) rename to_sent to to_send.
> b) clear bq dev_rx, xdp_prog and flush_node in __dev_flush().
> 
> v16:
> refactor bq_xmit_all logic and remove error label for patch 01
> 
> v15:
> Update bq_xmit_all() logic for patch 01.
> Add some comments and remove useless variable for patch 03.
> Use bpf_object__find_program_by_title() for patch 04 and 06.
> 
> v14:
> No code update, just rebase the code on latest bpf-next
> 
> v13:
> Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
> code in patch 03.
> 
> v12:
> Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
> Add 2nd xdp_prog test on the sample and selftests.
> 
> v11:
> Fix bpf_redirect_map_multi() helper description typo.
> Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().
> 
> v10:
> Rebase the code to latest bpf-next.
> Update helper bpf_xdp_redirect_map_multi()
> - No need to check map pointer as we will do the check in verifier.
> 
> v9:
> Update helper bpf_xdp_redirect_map_multi()
> - Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2
> 
> v8:
> a) Update function dev_in_exclude_map():
>    - remove duplicate ex_map map_type check in
>    - lookup the element in dev map by obj dev index directly instead
>      of looping all the map
> 
> v7:
> a) Fix helper flag check
> b) Limit the *ex_map* to use DEVMAP_HASH only and update function
>    dev_in_exclude_map() to get better performance.
> 
> v6: converted helper return types from int to long
> 
> v5:
> a) Check devmap_get_next_key() return value.
> b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
> c) In function dev_map_enqueue_multi(), consume xdpf for the last
>    obj instead of the first on.
> d) Update helper description and code comments to explain that we
>    use NULL target value to distinguish multicast and unicast
>    forwarding.
> e) Update memory model, memory id and frame_sz in xdpf_clone().
> f) Split the tests from sample and add a bpf kernel selftest patch.
> 
> v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo
> 
> v3: Based on Toke's suggestion, do the following update
> a) Update bpf_redirect_map_multi() description in bpf.h.
> b) Fix exclude_ifindex checking order in dev_in_exclude_map().
> c) Fix one more xdpf clone in dev_map_enqueue_multi().
> d) Go find next one in dev_map_enqueue_multi() if the interface is not
>    able to forward instead of abort the whole loop.
> e) Remove READ_ONCE/WRITE_ONCE for ex_map.
> 
> v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
> include/exclude maps directly.
> 
> Hangbin Liu (5):
>   bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
>   xdp: add a new helper for dev map multicast support
>   sample/bpf: add xdp_redirect_map_multicast test
>   selftests/bpf: Add verifier tests for bpf arg
>     ARG_CONST_MAP_PTR_OR_NULL
>   selftests/bpf: add xdp_redirect_multi test
> 
> Jesper Dangaard Brouer (1):
>   bpf: run devmap xdp_prog on flush instead of bulk enqueue
> 
>  include/linux/bpf.h                           |  21 ++
>  include/linux/filter.h                        |   1 +
>  include/net/xdp.h                             |   1 +
>  include/uapi/linux/bpf.h                      |  28 ++
>  kernel/bpf/devmap.c                           | 262 +++++++++++----
>  kernel/bpf/verifier.c                         |  16 +-
>  net/core/filter.c                             | 124 ++++++-
>  net/core/xdp.c                                |  29 ++
>  samples/bpf/Makefile                          |   3 +
>  samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
>  samples/bpf/xdp_redirect_map_multi_user.c     | 302 ++++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |  28 ++
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  .../bpf/progs/xdp_redirect_multi_kern.c       | 111 +++++++
>  tools/testing/selftests/bpf/test_verifier.c   |  22 +-
>  .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
>  .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
>  .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
>  18 files changed, 1501 insertions(+), 67 deletions(-)
>  create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
>  create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
>  create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
>  create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
>  create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c
> 
> -- 
> 2.26.2
>
John Fastabend Feb. 4, 2021, 2:53 a.m. UTC | #2
Hangbin Liu wrote:
> Hi Daniel, Alexei,
> 
> It has been one week after Maciej, Toke, John's review/ack. What should
> I do to make a progress for this patch set?
> 

Patchwork is usually the first place to check:

 https://patchwork.kernel.org/project/netdevbpf/list/?series=421095&state=*

Looks like it was marked changed requested. After this its unlikely
anyone will follow up on it, rightly so given the assumption another
revision is coming.

In this case my guess is it was moved into changes requested because
I asked for a change, but then after some discussion you convinced me
the change was not in fact needed.

Alexei, Daniel can probably tell you if its easier to just send a v18
or pull in the v17 assuming any final reviews don't kick anything
else up.

Thanks
John
Hangbin Liu Feb. 4, 2021, 3:12 a.m. UTC | #3
On Wed, Feb 03, 2021 at 06:53:20PM -0800, John Fastabend wrote:
> Hangbin Liu wrote:
> > Hi Daniel, Alexei,
> > 
> > It has been one week after Maciej, Toke, John's review/ack. What should
> > I do to make a progress for this patch set?
> > 
> 
> Patchwork is usually the first place to check:

Thanks John for the link.
> 
>  https://patchwork.kernel.org/project/netdevbpf/list/?series=421095&state=*

Before I sent the email I only checked link
https://patchwork.kernel.org/project/netdevbpf/list/ but can't find my patch.

How do you get the series number?

> 
> Looks like it was marked changed requested. After this its unlikely
> anyone will follow up on it, rightly so given the assumption another
> revision is coming.
> 
> In this case my guess is it was moved into changes requested because
> I asked for a change, but then after some discussion you convinced me
> the change was not in fact needed.
> 
> Alexei, Daniel can probably tell you if its easier to just send a v18
> or pull in the v17 assuming any final reviews don't kick anything
> else up.

OK, I will wait for Alexei, Daniel and see if I need to do a rebase.

Thanks
Hangbin
Toke Høiland-Jørgensen Feb. 4, 2021, 11 a.m. UTC | #4
Hangbin Liu <liuhangbin@gmail.com> writes:

> On Wed, Feb 03, 2021 at 06:53:20PM -0800, John Fastabend wrote:
>> Hangbin Liu wrote:
>> > Hi Daniel, Alexei,
>> > 
>> > It has been one week after Maciej, Toke, John's review/ack. What should
>> > I do to make a progress for this patch set?
>> > 
>> 
>> Patchwork is usually the first place to check:
>
> Thanks John for the link.
>> 
>>  https://patchwork.kernel.org/project/netdevbpf/list/?series=421095&state=*
>
> Before I sent the email I only checked link
> https://patchwork.kernel.org/project/netdevbpf/list/ but can't find my patch.
>
> How do you get the series number?

If you click the "show patches with" link at the top you can twiddle the
filtering; state = any + your own name as submitter usually finds
things, I've found.

>> Looks like it was marked changed requested. After this its unlikely
>> anyone will follow up on it, rightly so given the assumption another
>> revision is coming.
>> 
>> In this case my guess is it was moved into changes requested because
>> I asked for a change, but then after some discussion you convinced me
>> the change was not in fact needed.
>> 
>> Alexei, Daniel can probably tell you if its easier to just send a v18
>> or pull in the v17 assuming any final reviews don't kick anything
>> else up.
>
> OK, I will wait for Alexei, Daniel and see if I need to do a rebase.

I think I would just resubmit with a rebase + a note in the changelog
that we concluded no further change was needed :)

-Toke
Fijalkowski, Maciej Feb. 4, 2021, 12:09 p.m. UTC | #5
On Thu, Feb 04, 2021 at 12:00:29PM +0100, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:
> 
> > On Wed, Feb 03, 2021 at 06:53:20PM -0800, John Fastabend wrote:
> >> Hangbin Liu wrote:
> >> > Hi Daniel, Alexei,
> >> > 
> >> > It has been one week after Maciej, Toke, John's review/ack. What should
> >> > I do to make a progress for this patch set?
> >> > 
> >> 
> >> Patchwork is usually the first place to check:
> >
> > Thanks John for the link.
> >> 
> >>  https://patchwork.kernel.org/project/netdevbpf/list/?series=421095&state=*
> >
> > Before I sent the email I only checked link
> > https://patchwork.kernel.org/project/netdevbpf/list/ but can't find my patch.
> >
> > How do you get the series number?
> 
> If you click the "show patches with" link at the top you can twiddle the
> filtering; state = any + your own name as submitter usually finds
> things, I've found.
> 
> >> Looks like it was marked changed requested. After this its unlikely
> >> anyone will follow up on it, rightly so given the assumption another
> >> revision is coming.
> >> 
> >> In this case my guess is it was moved into changes requested because
> >> I asked for a change, but then after some discussion you convinced me
> >> the change was not in fact needed.
> >> 
> >> Alexei, Daniel can probably tell you if its easier to just send a v18
> >> or pull in the v17 assuming any final reviews don't kick anything
> >> else up.
> >
> > OK, I will wait for Alexei, Daniel and see if I need to do a rebase.
> 
> I think I would just resubmit with a rebase + a note in the changelog
> that we concluded no further change was needed :)

I only asked for imperative mood in commit messages, but not sure if
anyone cares ;)

> 
> -Toke
>
Hangbin Liu Feb. 4, 2021, 1:33 p.m. UTC | #6
On Thu, Feb 04, 2021 at 01:09:22PM +0100, Maciej Fijalkowski wrote:
> > I think I would just resubmit with a rebase + a note in the changelog
> > that we concluded no further change was needed :)
> 
> I only asked for imperative mood in commit messages, but not sure if
> anyone cares ;)

I will try, but could not guarantee I can fix all the sentences.

Thanks
hangbin
Hangbin Liu Feb. 4, 2021, 2:03 p.m. UTC | #7
This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, a new helper bpf_redirect_map_multi() is implemented
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
The 2st patch add a new bpf arg to allow NULL map pointer.
The 3rd patch add the new bpf_redirect_map_multi() helper.
The 4-6 patches are for usage sample and testing purpose.

I did same perf tests with the following topo:

---------------------             ---------------------
| Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
---------------------             |                   |
                                  |   Host B          |
---------------------             |                   |
| Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
---------------------    vlan2    |          -------- |
                                  | veth1 -- | veth0| |
                                  |          -------- |
                                  --------------------|
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
The veth0 in netns load dummy drop program. The forward_map max_entries in
xdp_redirect_map_multi is modify to 4.

Here is the perf result with 5.10 rc6:

The are about +/- 0.1M deviation for native testing
Version             | Test                                    | Generic | Native | Native + 2nd
5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v18: no update, just rebase the code to latest bpf-next

v17:
For patch 01:
a) rename to_sent to to_send.
b) clear bq dev_rx, xdp_prog and flush_node in __dev_flush().

v16:
refactor bq_xmit_all logic and remove error label for patch 01

v15:
Update bq_xmit_all() logic for patch 01.
Add some comments and remove useless variable for patch 03.
Use bpf_object__find_program_by_title() for patch 04 and 06.

v14:
No code update, just rebase the code on latest bpf-next

v13:
Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
code in patch 03.

v12:
Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
Add 2nd xdp_prog test on the sample and selftests.

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Rebase the code to latest bpf-next.
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
a) Update function dev_in_exclude_map():
   - remove duplicate ex_map map_type check in
   - lookup the element in dev map by obj dev index directly instead
     of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (5):
  bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: Add verifier tests for bpf arg
    ARG_CONST_MAP_PTR_OR_NULL
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  21 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  28 ++
 kernel/bpf/devmap.c                           | 262 +++++++++++----
 kernel/bpf/verifier.c                         |  16 +-
 net/core/filter.c                             | 124 ++++++-
 net/core/xdp.c                                |  29 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 302 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  28 ++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       | 111 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  22 +-
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
 .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
 .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
 18 files changed, 1501 insertions(+), 67 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c
Jakub Kicinski Feb. 4, 2021, 5:03 p.m. UTC | #8
On Thu, 04 Feb 2021 12:00:29 +0100 Toke Høiland-Jørgensen wrote:
> >> Patchwork is usually the first place to check:  
> >
> > Thanks John for the link.  
> >> 
> >>  https://patchwork.kernel.org/project/netdevbpf/list/?series=421095&state=*  
> >
> > Before I sent the email I only checked link
> > https://patchwork.kernel.org/project/netdevbpf/list/ but can't find my patch.
> >
> > How do you get the series number?  
> 
> If you click the "show patches with" link at the top you can twiddle the
> filtering; state = any + your own name as submitter usually finds
> things, I've found.

New patchwork can actually find messages by Message-ID header.

Just slap message ID of one of the patches at the end of:

https://patchwork.kernel.org/project/netdevbpf/patch/

And there is a link to entire series there.


Since I'm speaking, Hangbin I'd discourage posting new version 
as a reply to previous posting. It brings out this massive 100+
message thread and breaks natural ordering of patches to review.
Hangbin Liu Feb. 5, 2021, 3:07 a.m. UTC | #9
Hi John,
On Thu, Feb 04, 2021 at 09:03:23AM -0800, Jakub Kicinski wrote:
> New patchwork can actually find messages by Message-ID header.
> 
> Just slap message ID of one of the patches at the end of:
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/
> 
> And there is a link to entire series there.

Thanks for the tips.

> 
> Since I'm speaking, Hangbin I'd discourage posting new version 
> as a reply to previous posting. It brings out this massive 100+
> message thread and breaks natural ordering of patches to review.

Thanks for the reminder. I will not reply to previous version and
will only use a link in future.

Thanks
Hangbin